How to Write Readable Code as a Data Scientist

Imagine stepping into a data science project halfway through its lifecycle. You’re tasked with understanding the codebase, making changes, and adding new features. But you're greeted with a labyrinth of code, a tangled web of variables, functions, and logic that seems to defy comprehension. As you attempt to decipher its mysteries, frustration sets in, and clarity becomes a distant dream.

This scenario is all too common in the world of data science, just as it is in software engineering. Usually, data scientists are more focused on the data and the models they build, rather than the code that powers them. But at the end of the day, code is the glue that holds everything together. It’s the bridge between your ideas and the end product.

In this article, we’ll explore the art of writing readable code and provide some practical tips.

Subscribe

Join now and get tips and tricks directly in your inbox. 📬 Learn how to build ML systems end-to-end. 🚀

Why Readable Code Matters

Let's imagine you are fresh out of college and have just landed your first job as a data scientist. Your first task is to understand a data analysis pipeline designed to predict customer churn. You are excited to dive in and make your mark. But as you start reading the code, you realize it's a mess. Variable names like x1, y2, and tempVar are scattered throughout the code. Functions are hundreds of lines long, with no comments or documentation to explain what they do. Function names are unclear and don't give you any clue about their purpose. What should have been a straightforward task turns into a nightmare of trial and error. You spend hours trying to decipher what each function does, and yet the code remains a mystery.

Wouldn't it be great if the code were more readable? Wouldn't it be great if you could understand the code at a glance? If variable names were descriptive, functions were concise and well-documented, and the logic was easy to follow? Readable code is like a well-written novel. It tells a story, guides you through its narrative, and keeps you engaged from start to finish. It's a pleasure to read, easy to understand, and a joy to work with. Wouldn't it be great if you can write code that others love to read? Let's explore how you can do just that.

What Not to Do: Common Pitfalls in Writing Code

Bad Variable Names

One of the most common mistakes in writing code is using bad variable names.
For example, consider the following code snippet:

a = 10
b = 20
c = a + b
print(c)

What do a, b, and c represent? Without any context, it's hard to tell.
Now, let's rewrite the code with more descriptive variable names:

num_apples = 10
num_oranges = 20
total_fruits = num_apples + num_oranges
print(total_fruits)

If you write code like this, it's immediately clear what the variables represent. It looks more like a story than a puzzle, making it easier to understand and work with. Always strive to use meaningful names that convey the purpose of the variable. Avoid generic names like temp, data, or result. Instead, use names that describe the data or its role in the program.

This small change can make a big difference in the readability of your code.

Long Functions

Another common pitfall in writing code is creating long functions that do too much. Long functions are hard to understand, debug, and maintain. They violate the single responsibility principle, which states that a function should do one thing and do it well. When a function is hundreds of lines long, it becomes a tangled mess of logic that is difficult to follow.

Let's look at an example of a long function:

def process_data(data):
    # Clean the data
    cleaned_data = []
    for item in data:
        if item is not None:
            cleaned_data.append(item.strip())

    # Calculate summary statistics
    total = 0
    count = 0
    for item in cleaned_data:
        if item.isdigit():
            total += int(item)
            count += 1
    mean = total / count if count > 0 else 0

    # Normalize the data
    normalized_data = []
    for item in cleaned_data:
        if item.isdigit():
            normalized_data.append(int(item) / mean if mean != 0 else 0)

    return {
        'cleaned_data': cleaned_data,
        'mean': mean,
        'normalized_data': normalized_data
    }

This function is doing too much: cleaning data, calculating statistics, and normalizing data.
Let's break it into smaller, more focused functions.

# Function to clean the data
def clean_data(data):
    cleaned_data = []
    for item in data:
        if item is not None:
            cleaned_data.append(item.strip())
    return cleaned_data

# Function to calculate mean
def calculate_mean(data):
    total = sum(int(item) for item in data if item.isdigit())
    count = sum(1 for item in data if item.isdigit())
    return total / count if count > 0 else 0

# Function to normalize the data
def normalize_data(data, mean):
    normalized_data = []
    for item in data:
        if item.isdigit():
            normalized_data.append(int(item) / mean if mean != 0 else 0)
    return normalized_data

# Main function to process data
def process_data(data):
    cleaned_data = clean_data(data)  # Step 1: Clean the data
    mean = calculate_mean(cleaned_data)  # Step 2: Calculate the mean
    normalized_data = normalize_data(cleaned_data, mean)  # Step 3: Normalize the data

    return {
        'cleaned_data': cleaned_data,
        'mean': mean,
        'normalized_data': normalized_data
    }

By breaking the function into smaller, more focused functions, we improve readability and maintainability.

Key Principles:

Single Responsibility: Each function should focus on a single task, making it easier to understand and maintain.
Modularization: Splitting large functions into smaller ones promotes code reuse and simplifies debugging.
Meaningful Names: Use descriptive names for functions and variables to improve readability.
Function Length: A general rule of thumb is to keep functions under 20 lines, but this isn't strict.

Code Duplication

Code duplication is another common issue that can lead to maintenance headaches. When the same logic is repeated in multiple places, it becomes harder to update and maintain.

Let's look at an example of code duplication:

# Calculate the square of numbers in two different lists
list1 = [1, 2, 3]
list2 = [4, 5, 6]

# Duplication in squaring
squares1 = []
for num in list1:
    squares1.append(num ** 2)

squares2 = []
for num in list2:
    squares2.append(num ** 2)

Here, the logic for calculating the sum of the data is duplicated. Instead of repeating the same code, we can create a function to encapsulate the logic.

# Function to calculate squares
def calculate_squares(numbers):
    return [num ** 2 for num in numbers]

list1 = [1, 2, 3]
list2 = [4, 5, 6]

# Use the function to eliminate duplication
squares1 = calculate_squares(list1)
squares2 = calculate_squares(list2)

Yes, it's a simple example, but it illustrates the point, and the same principle applies to more complex scenarios. Applying the DRY (Don't Repeat Yourself) principle can help reduce code duplication and make your code more maintainable.

Poorly Structured Code

Code structure plays a crucial role in readability. A well-structured codebase is like a well-organized library, where each section has a clear purpose and is easy to navigate.

Here are some common issues with code structure:

data = [1, 2, 3, 4, 5]
mean = sum(data) / len(data)
std_dev = (sum([(x - mean) ** 2 for x in data]) / len(data)) ** 0.5
print(f"Mean: {mean}, Standard Deviation: {std_dev}")

This code calculates the mean and standard deviation of a list of numbers. However, it's hard to follow because the logic is not clearly separated. Let's refactor it to improve readability:

# Separation of concerns with functions
def calculate_mean(data):
    return sum(data) / len(data)

def calculate_standard_deviation(data, mean):
    return (sum([(x - mean) ** 2 for x in data]) / len(data)) ** 0.5

data = [1, 2, 3, 4, 5]
mean = calculate_mean(data)
std_dev = calculate_standard_deviation(data, mean)

print(f"Mean: {mean}, Standard Deviation: {std_dev}")

By separating the logic into functions, we improve readability and make the code easier to understand. Each function now has a clear purpose, and the main code is more concise and focused.

Lack of Comments and Documentation

Comments and documentation are essential for understanding code. They provide context, explain the rationale behind the code, and guide developers through its logic.

Let's look at an example of code without comments:

# Function to process data
def process_data(data):

    transformed_data = [x * 2 for x in data if x > 0]

    # Calculate something
    result = sum(transformed_data) / len(transformed_data)

    # Return the result
    return result

# List of values to process
values = [1, 2, 3, -1, -2, -3]
# Process data
final_result = process_data(values)
# Print the final result
print(final_result)

Without comments, it's hard to understand the purpose of the code and how it works. The process_data function is doing some transformation and calculation, but the logic is not clear. The variable names are also not very descriptive, making it harder to follow.

Let's add comments but keep the same code:

# Function to process data: doubles positive numbers and calculates the average
def process_data(data):
    """
    This function takes a list of numbers, doubles the positive ones,
    and returns the average of the transformed values.

    Parameters:
    data (list of int): The input list containing numbers to be processed.

    Returns:
    float: The average of the transformed positive numbers.
    """
    # Double each positive number
    transformed_data = [x * 2 for x in data if x > 0]

    # Calculate the average of transformed data
    # We assume there's at least one positive number in the input
    result = sum(transformed_data) / len(transformed_data)

    return result


# List of values to process
values = [1, 2, 3, -1, -2, -3]

# Process the data to get the average of doubled positive numbers
final_result = process_data(values)

# Display the final result to the user
print(f"Average of doubled positive numbers: {final_result:.2f}")

With comments, we now have a better understanding of what the code does and how it does it. The function's purpose is clear, and the comments provide additional context to guide the reader.

Conclusion

Writing readable code is an essential skill for any data scientist. It not only makes your code easier to understand and maintain but also improves collaboration with your team. In this article, we've covered some common pitfalls in writing code and provided practical tips to avoid them. By following these guidelines, you can write code that tells a clear story, guides the reader through its logic, and makes your work a joy to read and work with. Be the author of code that others love to read! Go ahead and write code that tells a story.

If you liked this article and want to get more tips like these, consider subscribing to my newsletter, and get the latest updates directly in your inbox.

Thank you for reading! I hope you found this article helpful. If you have any questions or comments, feel free to reach out.