The Old Way: Random Scripts, Random Wins

The Python Project That Finally Made Machine Learning “Click” For Me

By abualyaanartPublished about an hour ago • 9 min read

BY Abualyaanart

Not because I learned some secret library — but because I stopped treating Python like a collection of tricks.

The turning point was a pull request review with a single comment from my tech lead:

“This works, but I have no idea how you think about problems.”

The PR added a new machine learning feature flag. The code ran. Tests passed. The model improved our metric by a tiny amount.

And yet that comment hit harder than any failing unit test.

Because it was true.

My code looked like a Frankenstein of Stack Overflow answers, copied notebooks, and half-understood pandas incantations. I could duct-tape my way to something that ran, but there was no structure. No trace of judgment. Just “Python tricks until it works.”

That review forced me to admit something I’d been avoiding for years:

I could “do machine learning in Python,” but I couldn’t use Python to build real ML systems.

The Old Way: Random Scripts, Random Wins

For a long time, my workflow was basically this:

Find a dataset or business request.

Spin up a Jupyter notebook.

Import pandas, NumPy, scikit-learn, matplotlib.

Try whatever I’d recently seen in a blog post.

Keep running cells until something “looked good.”

A typical notebook of mine had:

Data loading mixed with feature engineering

Model training buried under chart code

Half a dozen “test_2.ipynb”, “test_final.ipynb”, “final_final.ipynb” files

Zero clear entry point for anyone else to run anything

It was fine for Kaggle, where the only goal was a leaderboard score and a fancy metric.

But as soon as my work touched production, reality pushed back:

A teammate needed to rerun my experiment and couldn’t.

A bug in preprocessing silently corrupted predictions.

An “easy” change required editing code in four places.

A downstream team asked, “What happens if we log this differently?” and I had no idea.

I thought my problem was that I didn’t know enough algorithms.

The real problem was that I didn’t know how to use Python as a tool for thinking.

I was writing machine learning scripts. I wasn’t building machine learning software.

The Trigger: A Failed “Simple” Change

The wake-up call came from what was supposed to be a trivial update.

We had a model that scored user sessions for a recommendation system. Product asked a straightforward question:

“Can we add a feature for ‘sessions in the last 7 days’ and see if it helps?”

In my head: sure, that’s a 30-minute change.

In reality, this “simple” change took nearly two weeks, three people, and exposed every weakness in my Python setup:

The feature engineering lived inside a single 600-line notebook cell.

That cell depended on some global variables defined 20 cells earlier.

I had hard-coded a date filter “for convenience” three months ago and forgotten.

The training data pipeline existed only inside the notebook; nothing was reusable.

We added the new feature. The model score improved.

Then the batch job failed in production, because I’d used pd.to_datetime with utc=True in one place and without UTC in another. The keys no longer matched.

It took a day to track down. It was my code.

That incident forced a hard question:

If adding one feature can break everything, how will this ever scale?

The Build: Treating Python Like a System, Not a Notebook

I decided to treat my next machine learning project like a small software product, not a side experiment.

No new libraries. No new buzzwords. Just a new way of organizing the Python I already knew.

The project: a lead scoring model for a small sales team. Think: predict the probability that an incoming sign-up would become a paying customer.

Same ingredients as always:

Pandas for data

scikit-learn for models

Some plotting for sanity checks

The difference was in how I structured the work.

1. One Entry Point, No Magic

I forced myself to have a single command anyone could run:

python -m leadscore.train --config configs/base.yaml

Inside leadscore/train.py, I did exactly three things:

Load config.

Call build_dataset(config).

Call train_model(dataset, config).

No data munging inside train.py. No model definition here. Just orchestration.

The first benefit: I could run the entire pipeline from scratch with one command. No more “run cell 12, then cell 27, but skip 25.”

2. Separate Concerns For Real

I split the code into three Python modules:

leadscore/data.py — how raw data turns into a model-ready table

leadscore/features.py — pure feature functions with tests

leadscore/model.py — model definition, training, evaluation

A feature looked like this:

def sessions_last_7_days(events_df, now):

cutoff = now - pd.Timedelta(days=7)

recent = events_df[events_df["timestamp"] >= cutoff]

return

recent.groupby("user_id")

.size()

.rename("sessions_last_7_days")

.astype("int32")

Nothing fancy. But now I could:

Test this function in isolation with a tiny DataFrame.

Reuse it in training and in inference.

See exactly what “sessions_last_7_days” meant, in one place.

My old notebook had this logic scattered across four cells.

3. Make Failure Obvious, Not Silent

I added embarrassingly simple checks:

assert not df.isna().any().any(), "NaNs found in training data"

assert df["label"].isin([0, 1]).all(), "Labels must be 0 or 1"

I versioned the schema with Pydantic models:

from pydantic import BaseModel

class TrainingRow(BaseModel):

user_id: int

sessions_last_7_days: int

country: str

label: int

Incoming data had to pass this before training.

This caught an issue where one of our sources started emitting label = 2 for a “trial expired” state. Old me would have ignored it or silently coalesced it to 1. New me had the code fail loudly, so we had to decide what it meant.

4. Track Experiments Like an Adult

Until then, my “experiment tracking” was cell comments like # best so far and filenames like model_v3_good.ipynb.

For this project, I used a simple CSV log:

timestamp,config_hash,auc,precision_at_0_8_recall,notes

2023-04-11T10:32Z,8f1c9b,0.742,0.31,"baseline"

2023-04-11T11:05Z,f45a33,0.751,0.34,"+ sessions_last_7_days"

Every time train_model ran, it wrote a row. I didn’t even bother with MLflow at first; a CSV was enough.

The point wasn’t tooling. The point was: if I can’t answer “What changed between these two runs?” in one place, I’m not learning, I’m guessing.

The Realization: Python Wasn’t the Problem

Somewhere in the second week of this project, I noticed something strange.

I wasn’t Googling “pandas how to groupby” anymore. I wasn’t copy-pasting code snippets from Kaggle kernels.

I was asking different questions:

Where should this logic live so it’s testable?

If I change how we define “active user”, what breaks?

What happens if this feature is missing in production?

How do I make this obvious when someone else reads it?

The language hadn’t changed. I had.

Python stopped being my goal (“use more advanced techniques!”) and became a medium for encoding decisions.

The lead scoring model itself was not impressive:

A regularized logistic regression

AUC around 0.75

Some basic categorical encoding

A handful of boring features

But the process was solid enough that:

We could add a feature in under an hour.

We could roll back a bad experiment in seconds.

We could explain every column to the sales team.

For the first time, I felt like I was doing machine learning on purpose.

How the System Actually Works

Under the hood, the whole thing boiled down to a simple pattern I’ve reused ever since.

Input layer: small set of functions that know how to fetch raw data

Example: load_events(start_date, end_date).

Transformation layer: pure Python functions operating on DataFrames, each one doing one thing

Example: sessions_last_7_days(events_df, now).

Schema layer: Pydantic models to enforce what a “row” means.

Model layer: classes that expose three methods: fit, predict_proba, explain. Even for scikit-learn models, I wrapped them.

Experiment layer: a logger that records config + metrics.

Interface layer: a main() that wires everything together from a config file.

That’s it. No fancy orchestration tool. No auto-ML. No giant framework.

The magic wasn’t the pattern itself. It was what the pattern forced:

Deciding what each piece of code is responsible for

Writing functions that are small enough to test

Making failure loud instead of subtle

Giving every experiment an identity

Once I saw it this way, I couldn’t unsee it. Every one-off notebook after that felt painfully fragile.

Impact: Fewer “Heroics,” More Boring Wins

So what did this change, beyond my ego?

On that lead scoring project:

Time to add a new feature went from ~2–3 days to under 2 hours.

Retraining the model after a schema change dropped from a full day to ~20 minutes.

The number of “oops, we broke something” incidents went from weekly to maybe once in three months.

I’m not exaggerating those numbers for effect. I went back and checked Git history and Slack timestamps. The difference was real.

We also caught a nasty bug early:

I had defined “active user” as “any event in the last 30 days.” Our billing system defined it as “any paid invoice in the last 30 days.” That difference alone would have completely skewed our model.

Because the schema was explicit and the features were named, it took a 5-minute conversation to spot the mismatch.

Before, this is the kind of thing I would have discovered weeks later when someone says, “Why are we calling these people ‘active’ if they canceled last month?”

The best part: the sales team trusted the model because they could see the logic. We literally sat with them and walked through the feature code. It wasn’t magic. It was just Python.

What Didn’t Work (And Still Doesn’t)

I wish I could say I’ve perfectly applied this system ever since. I haven’t.

I still fall back into old habits:

Quick exploratory notebooks with messy state

“Temporary” scripts that live in production for a year

Silent assumptions about time zones and currencies

And this system has real limits:

It doesn’t solve data quality issues upstream.

It doesn’t replace good monitoring in production.

It doesn’t magically make non-technical stakeholders patient.

I also underestimated how much overhead this structure adds for truly tiny tasks. Sometimes you don’t need a schema and five layers. Sometimes you just need a 30-line script and a one-off chart.

The trade-off I’ve accepted is this:

For anything that might live longer than a week or touch real users, I treat it like a system.

For quick one-off analyses, I allow myself messy notebooks — but I don’t pretend they’re reusable.

The discipline is not free. But debugging chaos at 11:30 PM is more expensive.

A Practical Framework If You’re Still Gluing Scripts Together

If you see yourself in my “before” story, here’s a minimal way to start shifting without rewriting your life.

1. Give Every Project One Entry Point

Pick a folder and add a main.py with a main() function.

Have it do only orchestration:

def main(config_path: str):

config = load_config(config_path)

df = build_dataset(config)

model, metrics = train_model(df, config)

save_artifacts(model, metrics, config)

You’ll feel it the first time you rerun everything with one command.

2. Extract One Pure Feature Function Per Week

Take one feature you always build and turn it into a reusable function with tests.

Example:

def days_since_last_login(events_df, now):

last_login =

events_df[events_df["event_type"] == "login"]

.groupby("user_id")["timestamp"]

.max()

return (now - last_login).dt.days.rename("days_since_last_login")

Write a tiny test with three fake users. You’ve just made one part of your pipeline honest.

3. Write Down Your Assumptions As Code

Anywhere you’re making an assumption, make the code say it.

Instead of:

df["label"] = df["status"] == "paid"

Force the issue:

VALID_STATUSES = {"trial", "paid", "canceled"}

assert set(df["status"].unique()).issubset(VALID_STATUSES)

df["label"] = (df["status"] == "paid").astype(int)

You’re not just protecting the code. You’re communicating with your future self.

4. Log One Metric Per Run

Don’t overthink experiment tracking.

Open a file called experiments.csv and append to it every time:

log_experiment(

config_hash=config_hash,

auc=metrics["auc"],

notes="baseline with no session features",

)

Even a crude log will change how you think. It forces you to treat each run as a decision, not a guess.

The Identity Shift: From “Python User” to “System Builder”

The biggest change for me wasn’t technical.

It was how I thought about my role.

I stopped seeing myself as someone who “writes Python for machine learning” and started seeing myself as someone who designs systems that happen to use Python and machine learning.

That sounds subtle, but it shifts everything:

You care less about the newest library and more about failure modes.

You care less about clever one-liners and more about clear boundaries.

You care less about your favorite model and more about how it behaves when nothing goes as planned.

When my tech lead left that PR comment — “I have no idea how you think about problems” — he was telling me something uncomfortable:

I was using Python as camouflage. The code ran, but it didn’t reveal any judgment.

These days, if someone reads my project and still can’t see how I think about data, risk, and trade-offs, I consider that a bug.

If you’re feeling that same discomfort right now, good. It means you’re ready to stop gluing scripts together and start using Python the way it’s used by the people you want to work with:

Not as a bag of tricks.

As a language for decisions.

Achievements Inspiration Prompts Process

About the Creator

abualyaanart

I write thoughtful, experience-driven stories about technology, digital life, and how modern tools quietly shape the way we think, work, and live.

I believe good technology should support life

Abualyaanart

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from abualyaanart and writers in Writers and other communities.

The Old Way: Random Scripts, Random Wins

The Python Project That Finally Made Machine Learning “Click” For Me

About the Creator

abualyaanart

Reader insights

Be the first to share your insights about this piece.

Comments

Keep reading

How I Used to “Be a Writer”

Paul and John's 47-Words Short Story Unofficial Challenge

The Furry Thief

Last Embers