• Python Snacks
  • Posts
  • Leveraging the Power of Python Generators for Real-Time Data Processing

Leveraging the Power of Python Generators for Real-Time Data Processing

Explore how to use Python generators to enhance efficiency and scalability in your data-heavy applications, with real-world examples and tips.

When coding in Python, it’s common to use lists and tuples for data storage and iteration

While these data structures will typically work for most scripts and applications, sometimes these iterables become too large and becomes inefficient.

For example, if we were to sum 1 million numbers, we may write our code like this:

numbers = [i for i in range(1000000)]
sum_numbers = sum(numbers)

However, a drawback of this is that we need to store every number in memory. For small datasets, this isn’t an issue, but what if you had 1 billion numbers? What about 1 trillion? What about 1 centillion (10303) numbers?

To be able to handle this, we may use a python generator, as such:

def generate_numbers():
    for i in range(1000000):
         yield i

sum_numbers = sum(generate_numbers())

In this article, I want to discuss how python generators work and provide a practical example using Pandas.

What is a Python generator?

Python generators are a cornerstone for writing efficient code, especially when working with data streams that are too large to fit into memory.

Generators allow for lazy evaluation - yielding items only as they’re needed. This allows developers to iterate over data on the fly without storing it all at once, which can dramatically reduce memory overhead and processing time.

Generator functions leverage the yield keyword instead of return. This allows the code to “pick up” where it left off.

When to use Python generators

In an ideal scenario, you’re going to want to use Python generators when you’re:

  1. Processing large datasets: If the dataset is too large, you may not want to load the entire dataset into memory.

  2. Implementing lazy evaluation: Sometimes, we need to only load the data on an need-to-know basis.

  3. Managing streams of data: This includes source files logs, databases, and more.

Examples of using Python generators

Data manipulation using Pandas

Generators can be very powerful when paired with Pandas to process chunks of data:

import pandas as pd

def chunk_data(file_name):
    for chunk in pd.read_csv(file_name, chunksize=1024):
        # Process each chunk. Here, filter by sales
        processed_chunk = chunk[chunk['sales'] > 1000]
        yield processed_chunk

# Consuming the generator
for data_chunk in chunk_data('sales_data.csv'):
    print(data_chunk)

Here, we’re opening a sales data CSV file, then reading it in with chunks and filtering it by any sales that have more than value of 1000.

Streaming log files

In a scenario where we want to stream data from the file to process each line and extract certain information, we can leverage generators instead of reading the entire log file into memory each iteration:

def stream_file(file_path):
    """Stream a file line by line"""
    with open(file_path, 'r') as file:
        for line in file:
            # Check to see if it's a line we want
            if "ERROR" in line:
                 yield line

# Define the log file
log_path = 'logfile.log'

# Save all of the error lines
error_lines = stream_file(log_path)

# Print the error lines.
for error_line in error_lines:
    print(error_line.strip())

📧Join the Python Snacks Newsletter!🐍

Want even more Python-related content that’s useful? Here’s 3 reasons why you should subscribe the Python Snacks newsletter:

  1. Get Ahead in Python with bite-sized Python tips and tricks delivered straight to your inbox, like the one above.

  2. Exclusive Subscriber Perks: Receive a curated selection of up to 6 high-impact Python resources, tips, and exclusive insights with each email.

  3. Get Smarter with Python in under 5 minutes. Your next Python breakthrough could just an email away.

You can unsubscribe at any time.

Interested in starting a newsletter or a blog?

Do you have a wealth of knowledge and insights to share with the world? Starting your own newsletter or blog is an excellent way to establish yourself as an authority in your field, connect with a like-minded community, and open up new opportunities.

If TikTok, Twitter, Facebook, or other social media platforms were to get banned, you’d lose all your followers. This is why you should start a newsletter: you own your audience.

This article may contain affiliate links. Affiliate links come at no cost to you and support the costs of this blog. Should you purchase a product/service from an affiliate link, it will come at no additional cost to you.

Reply

or to participate.