30 days of Data Engineering: Day 4

Sarang Surve

13 min readApr 1, 2023

Welcome back, peeps!

The pre-requisite to Day 4 is to complete the first three days' blog (link below):

30 days of Data Engineering: Day 1

Day 1 — What is Data Engineering; Why Data Engineering; Data Engineers vs. ML Engineers & Data Scientists; Purpose…

medium.com

30 days of Data Engineering: Day 2

Day 2 — Basic Python for Data Engineering

medium.com

30 days of Data Engineering: Day 3

Day 3 — Advanced Python for Data Engineering

medium.com

This is Day 4 of the Data Engineering series where we will be covering some topics on Techniques to write efficient and optimized code for Data Engineering.

4. Techniques to write efficient and optimized code for Data Engineering

We will be covering below Python topics in detail with hands-on coding exercises

Enumerate
Zip
Built-in Functions and Libraries
NumPy arrays
Multiple Assignments
Comprehensions
Membership operators
Counter class
Itertools
Sets to remove duplicates
Generators
Practice writing idiomatic code
Examine your code snippet’s runtime

Open up colab/jupyter notebook and start coding.
Let’s dive in!

Enumerate in Python

Enumerate function is used to write efficient python code. enumerate() is a built-in function that allows you to iterate over a sequence (such as a list, tuple, or string) and keep track of the index of each item. It returns an iterator that generates pairs of the form (index, item), where index is the index of the current item in the sequence, and item is the item itself.

Syntax:
for index, item in enumerate(sequence):
# do something with index and item

Here, sequence is the sequence you want to iterate over (such as a list or string). The enumerate() function takes the sequence as an argument and returns an iterator that generates (index, item) pairs. In the for loop, you can use the index and item variables to perform some operation on each item in the sequence.

Implementation —

fruits = ['apple', 'banana', 'orange']
for index, fruit in enumerate(fruits):
    print(index, fruit)

Output —

0 apple
1 banana
2 orange

You can also specify a starting value for the index by passing a second argument to enumerate().

Syntax:
enumerate(iterable, start=0)

Implementation —

fruits = ['apple', 'banana', 'orange']
for index, fruit in enumerate(fruits, start=1):
    print(index, fruit)

Output —

1 apple
2 banana
3 orange

enumerate() is a useful Python function that allows you to iterate over a sequence and keep track of the index of each item. It's a handy tool for tasks that require you to manipulate the index of items in a sequence, such as printing out numbered lists or creating dictionary keys from list items.

Zip in Python

In Python, the “zip” function is a built-in function that allows you to combine multiple iterables into a single iterable. The resulting iterable contains tuples, where the i-th tuple contains the i-th element from each of the input iterables. The “zip” function returns an iterator that generates tuples containing the elements from the input iterables.

Syntax:
zip(*iterators)
zip(iterable1, iterable2, iterable3, …)

Here, “iterable1”, “iterable2”, “iterable3”, etc. are the input iterables that you want to combine. You can pass any number of iterables to the “zip” function.

Implementation —

# Use Zip : Zip takes one or more iterables and aggregates them into tuple and returns the iterator object
name = ["Steve","Paul","Brad"]
roll_no = [4,1,3]
marks = [20,40,50]

mapped = zip(name,roll_no,marks)
mapped = set(mapped)
print(mapped)

Output —

{('Brad', 3, 50), ('Steve', 4, 20), ('Paul', 1, 40)}

Implementation —

list1 = [1, 2, 3]
list2 = ['a', 'b', 'c']
list3 = [True, False, True]

zipped_lists = zip(list1, list2, list3)

for item in zipped_lists:
    print(item)

Output —

(1, 'a', True)
(2, 'b', False)
(3, 'c', True)

You can also use the “zip” function to create a list of tuples by wrapping it in the “list” function, like this:

zipped_lists = list(zip(list1, list2, list3))

This will create a list of tuples containing the elements from the input lists.

Overall, the “zip” function is a useful tool for combining multiple iterables in Python, and it can be helpful for many different applications, including data processing, data analysis, and more.

Built-in Functions and Libraries

To make code work faster use built-in functions and libraries like map() which apply a function to every member of an iterable sequence and return the result.

Implementation —

"""
Map function : In Python, map() function applies the given function #to each item of a given iterable construct (i.e lists, tuples etc) and returns a map object.
"""
numbers =(100,200,300)
result = map(lambda x:x+x,numbers)
total = list(result)
print(total)

Output —

[200, 400, 600]

NumPy arrays

NumPy is a popular library in Python used for numerical computation, scientific computing, and data analysis. NumPy provides a powerful data structure called “NumPy arrays” that allows users to work with large, multidimensional arrays and matrices efficiently.

A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of positive integers. The dimensions of a NumPy array are known as axes. For example, a 1-dimensional NumPy array is often referred to as a “vector,” while a 2-dimensional NumPy array is often called a “matrix.” The number of axes is also known as the rank of the array.

NumPy arrays are created using the numpy.array() function, which takes a list, tuple, or any sequence-like object as input and returns a NumPy array.

Implementation —

import numpy as np

a = np.array([1, 2, 3, 4, 5])
print(a)

Output —

array([1, 2, 3, 4, 5])

NumPy arrays can be manipulated in many ways, including slicing and indexing, mathematical operations, and aggregation functions such as mean, median, and standard deviation. NumPy also provides a variety of functions for creating and manipulating arrays, such as numpy.zeros() for creating an array filled with zeros, numpy.ones() for creating an array filled with ones, and numpy.arange() for creating an array with a range of values.

Overall, NumPy arrays are a powerful data structure in Python that can be used for a wide range of numerical and scientific computations. They are efficient, flexible, and provide many useful functions for working with large amounts of data.

Multiple Assignments

Multiple assignments in Python allow you to assign values to multiple variables in a single line of code. It can be very useful for initializing variables, assigning multiple return values from a function, and also for simplifying code that involves swapping values between variables.

Syntax:
variable1, variable2, variable3 = value1, value2, value3
variable1, variable2 = variable2, variable1

In this syntax, you can assign values to multiple variables, separated by commas. The number of variables and values must match, otherwise, you will get a ValueError.

Implementation —

# Use multiple assignment
a, b, c = 10, 20, 30
print(a,b,c)

# To swap variable
a,b = b,a
print(a,b)

Output —

10 20 30
20 10

Comprehensions in Python

In Python, comprehension is a concise way of creating a new sequence (such as a list, set, or dictionary) by processing an existing sequence. It allows you to combine a loop and a conditional expression in a single line of code, making it more efficient and readable. There are three types of comprehension in Python:

A.) List comprehension: This creates a new list by applying an expression to each item in an existing sequence.

Syntax:
new_list = [expression for item in existing_sequence if condition]

Implementation —

numbers = [1, 2, 3, 4, 5, 6]
squares = [n**2 for n in numbers if n % 2 == 0]
print(squares)

Output —

[4, 16, 36]

B.) Set comprehension: This creates a new set by applying an expression to each item in an existing sequence.

Syntax:
new_set = {expression for item in existing_sequence if condition}

Implementation —

numbers = [1, 2, 3, 4, 5, 6]
squares = {n**2 for n in numbers if n % 2 == 0}
print(squares)

Output —

{16, 4, 36}

Dictionary comprehension: This creates a new dictionary by applying an expression to each item in an existing sequence.

Syntax:
new_dict = {key_expression: value_expression for item in existing_sequence if condition}

Implementation —

numbers = [1, 2, 3, 4, 5, 6]
squares = {n: n**2 for n in numbers if n % 2 == 0}
print(squares)

Output —

{2: 4, 4: 16, 6: 36}

Comprehensions are a powerful feature of Python that allows you to write concise and efficient code for creating new sequences based on existing ones.

Membership operators in Python

In Python, membership operators are used to test whether a value or variable is a member of a sequence or set. There are two membership operators: in and not in.

The in operator returns True if a value or variable is found in the sequence or set, and False otherwise. The not in operator returns the opposite result, i.e., True if the value or variable is not found in the sequence or set, and False otherwise.

Syntax:
# in operator
value in sequence
# not in operator
value not in sequence

Implementation —

# Define a list of colors
colors = ['red', 'green', 'blue', 'yellow']

# Test if 'red' is a member of the list
if 'red' in colors:
    print("Red is in the list of colors.")

# Test if 'orange' is not a member of the list
if 'orange' not in colors:
    print("Orange is not in the list of colors.")

Output —

Red is in the list of colors.
Orange is not in the list of colors.

Counter class in Python

In Python, a counter is a subclass of the dictionary object that is used to count the occurrences of elements in a sequence. The Counter class is available in the collections module of Python’s standard library.

Syntax:
from collections import Counter
my_counter = Counter(iterable)

Here, the iterable parameter is a sequence, such as a list or a tuple, containing the elements that need to be counted. Once the counter is created, it can be used to access the count of each element in the sequence.

Implementation —

from collections import Counter

my_list = [1, 2, 3, 1, 2, 1, 4, 5, 4, 4, 4]
my_counter = Counter(my_list)

print(my_counter)
print(my_counter[1])
print(my_counter[4])
print(my_counter[6])

Output —

Counter({1: 3, 4: 4, 2: 2, 3: 1, 5: 1})
3
4
0

In this example, we first import the Counter class from the collections module. We then create a list called my_list that contains some elements with repeated occurrences. We then create a Counter object called my_counter using my_list. The output of print(my_counter) shows the count of each element in the list.

We can then access the count of a specific element using the square bracket notation. For example, my_counter[1] returns the count of the element 1, which is 3. If an element is not present in the counter, its count will be zero. For example, my_counter[6] returns 0 because the element 6 is not present in the counter.

Itertools in Python

The itertools module is a standard Python library that provides a collection of functions for working with iterators. These functions are designed to work with iterable objects, such as lists, tuples, and generators, and can be used to perform a wide range of operations on them, including filtering, grouping, and combining. The itertools module provides many functions, including count(), cycle(), repeat(), groupby(), and chain(). These functions can be used to create complex iterators and perform various operations on them.

Here’s a brief explanation of some commonly used functions in the itertools module:

A.) count(start, step): This function generates an iterator that produces an infinite sequence of numbers, starting from start and incrementing by step at each iteration.

Syntax:
itertools.count(start=0, step=1)

Implementation —

import itertools

counter = itertools.count(5)

print(next(counter))
print(next(counter))
print(next(counter))

Output —

5
6
7

B.) cycle(iterable): This function generates an iterator that repeats the elements of the given iterable indefinitely.

Syntax:
itertools.cycle(iterable)

Implementation —

import itertools

colors = itertools.cycle(['red', 'green', 'blue'])

print(next(colors))
print(next(colors))
print(next(colors))
print(next(colors))

Output —

red
green
blue
red

C.) repeat(item, [times]): This function generates an iterator that produces the given item indefinitely or times number of times.

Syntax:
itertools.repeat(item, [times])

Implementation —

import itertools

repeated_items = itertools.repeat('Hello', 3)

print(next(repeated_items))
print(next(repeated_items))
print(next(repeated_items))

Output —

Hello
Hello
Hello

D.) groupby(iterable, [key]): This function generates an iterator that groups the elements of the given iterable based on the given key function.

Syntax:
itertools.groupby(iterable, [key])

Implementation —

import itertools

def group_key(x):
    return x.isupper()

grouped_items = itertools.groupby('aBcdEFgHi', group_key)

for key, group in grouped_items:
    print(key, list(group))

Output —

False ['a', 'B', 'c', 'd']
True ['E', 'F']
False ['g', 'H', 'i']

E.) chain(*iterables): This function generates an iterator that combines the elements of the given iterables into a single sequence.

Syntax:
itertools.chain(*iterables)

Implementation —

import itertools

numbers = [1, 2, 3]
letters = ['a', 'b', 'c']

combined = itertools.chain(numbers, letters)

for item in combined:
    print(item)

Output —

1
2
3
a
b
c

These are just a few examples of the many functions provided by the itertools module in Python.

Sets to remove duplicates

In Python, a set is a collection of unique elements. This means that any duplicates in the collection will automatically be removed when it is converted to a set.

Let’s say you have a list of numbers with duplicates. To remove the duplicates from this list, you can convert it to a set using the set() function, and then convert it back to a list.

Implementation —

numbers = [1, 2, 3, 2, 4, 3, 5, 6, 5]
unique_numbers = list(set(numbers))
print(unique_numbers)

Output —

[1, 2, 3, 4, 5, 6]

In the above example, we first create a list of numbers with duplicates. We then convert this list to a set using the set() function, which removes the duplicates automatically since sets only contain unique elements. Finally, we convert the set back to a list using the list() function and print the result.

This syntax can be used to remove duplicates from any collection in Python, including lists, tuples, and dictionaries (by converting the dictionary’s keys to a set).

Generators in Python

In Python, generators are functions that can be used to create iterators, which can be used to generate a sequence of values on the fly without having to store all of them in memory at once. There are two main ways to create generators in Python:

A.) Generators using the “yield” keyword:
To create a generator using the “yield” keyword, you simply define a function that contains one or more “yield” statements. Each time the generator is called, it will execute the function up to the point where it reaches a “yield” statement, and then return the value that follows the “yield” keyword. The function will then be paused and will resume execution the next time the generator is called.

Syntax
def generator_function():
# some code
yield value
# more code

Implementation —

def fibonacci_sequence(n):
    a, b = 0, 1
    for i in range(n):
        yield a
        a, b = b, a + b

# Using the generator function to print Fibonacci sequence
for num in fibonacci_sequence(10):
    print(num)

Output —

In the above example, the generator function fibonacci_sequence() generates the Fibonacci sequence using the "yield" keyword. The function returns each Fibonacci number one at a time when the generator is called. The loop at the bottom uses the generator to print the first 10 numbers of the sequence.

B.) Generators using the “range()” function:
To create a generator using the “range()” function, you simply pass the “range()” function to a loop or another iterator. This will generate a sequence of integers on the fly, without having to store all of them in memory at once.

Syntax
for i in range(start, stop, step):
# some code

Implementation —

for i in range(0, 10, 2):
    print(i)

Output —

In the above example, the “range()” function is used to generate a sequence of even numbers from 0 to 10, skipping every other number. The loop then prints each number in the sequence one at a time as it is generated.

Practice writing idiomatic code

Practice writing idiomatic code in Python as it will make your code run faster. If you write your code using the idioms or commonly accepted practices of the Python language, your code is likely to run faster and be more efficient. This is because idiomatic code uses language constructs and libraries that are optimized for performance and are more readable and maintainable.

Suppose you need to create a list of the squares of the first 10 non-negative integers. You could do this in Python using a for loop and the append() method.

Implementation —

squares = []
for i in range(10):
    squares.append(i ** 2)

However, a more idiomatic way to accomplish this same task would be to use list comprehension.

Implementation —

squares = [i ** 2 for i in range(10)]

This code is more concise and easier to read, and it uses a construct that is optimized for performance in Python. When benchmarked, the list comprehension version runs faster than the for loop version.

So, by practicing writing idiomatic code in Python, you can make your code faster and more efficient while also making it more readable and maintainable.

Examine your code snippet’s runtime

In Python, examining the runtime of your code snippet involves measuring how long it takes for a piece of code to execute. This can be useful for optimizing the performance of your code and identifying potential bottlenecks.

Implementation —

# Examine the runtime of your code snippet
import time

start_time = time.time()

# your code snippet here

end_time = time.time()

elapsed_time = end_time - start_time

print("Elapsed time: ", elapsed_time)

In this code, we import the time module, which allows us to measure the time taken by our code to execute. We then use the time.time() function to get the current time at the start and end of our code snippet. We subtract the start time from the end time to get the elapsed time, which is the amount of time taken by our code to execute.

To use this code, replace the # your code snippet here comment with the code you want to measure the runtime of. When you run the code, it will output the elapsed time in seconds.

Implementation —

# Examine the runtime of a simple loop
import time

start_time = time.time()

for i in range(1000000):
    pass

end_time = time.time()

elapsed_time = end_time - start_time

print("Elapsed time: ", elapsed_time)

In this code, we use a for loop to perform a simple operation 1 million times. We then measure the runtime of the loop using the time module and print out the elapsed time. When you run this code, you should see the elapsed time in seconds, which will give you an idea of how long it takes for the loop to execute.

Some of the most important optimization techniques are -

Use built-in functions and libraries: Python has a lot of built-in functions and libraries that are optimized for performance. Using them can save a lot of time and memory.
Avoid using global variables: Global variables can slow down the performance of your code and make it harder to debug.
Use list comprehensions: List comprehensions are a more efficient way to create and manipulate lists in Python.
Use generators: Generators are a way to create iterators in Python. They are more memory-efficient than lists because they only generate values on-the-fly as they are needed.
Use the “join” method instead of “+” for strings: The “+” operator creates a new string each time it is used, which can slow down your code. The “join” method is faster and more memory-efficient.
Use the “in” operator instead of the “index” method for lists: The “in” operator is faster for checking if an element is in a list.
Avoid using unnecessary loops: Unnecessary loops can slow down your code and use up more memory.
Use the “multiprocessing” module for parallel processing: The “multiprocessing” module allows you to run multiple processes in parallel, which can speed up your code.
Use “numpy” for numerical computations: The numpy library is highly optimized for numerical computations and can be significantly faster than pure Python code.
Profile and Optimize: Use profilers like cProfile, line_profiler, memory_profiler, etc. to profile and optimize your code.

That’s it for today!

Follow for more updates & Stay tuned!
Keep learning and coding

30 days of Data Engineering: Day 4

30 days of Data Engineering: Day 1

Day 1 — What is Data Engineering; Why Data Engineering; Data Engineers vs. ML Engineers & Data Scientists; Purpose…

30 days of Data Engineering: Day 2

Day 2 — Basic Python for Data Engineering

30 days of Data Engineering: Day 3

Day 3 — Advanced Python for Data Engineering

4. Techniques to write efficient and optimized code for Data Engineering

Enumerate in Python

Zip in Python

Built-in Functions and Libraries

NumPy arrays

Multiple Assignments

Comprehensions in Python

Membership operators in Python

Counter class in Python

Itertools in Python

Sets to remove duplicates

Generators in Python

Practice writing idiomatic code

Examine your code snippet’s runtime

Written by Sarang Surve

Responses (1)