Functions Run Everything / Exploring cuML
"Functions run everything" — Prof. Rushan Ziatdinov. Data powers functions.
Or so I'd like to tell you, from my data-centric point of view. Plainly speaking, it could be a bias ¯\_(ツ)_/¯, but I believe data is the true catalyst that makes anything work (at least in ML).
From a machine learning perspective, data is the manpower behind every model. Without it, there are no models. No models = no predictions. The hunger for data is so intense that there are over 5,000 active data centers in the USA alone (Cloudscene).
It's no joke that massive corporations spend millions upon MILLIONS storing and crunching data. More data = more signal to learn from.
As we know, a model's complexity only shines if there's enough data. Otherwise, volume beats sophistication.
So... what's this even about?
Did I just waste your time? I know data is useful. Why am I telling you this?
Because loads of data is everywhere, it needs to be processed faster. There need to be more efficient ways to crunch it. Today, most people use GPUs with deep learning frameworks like TensorFlow or Keras, bringing in sophisticated models of all sorts. Let's move back from what we know—why not have a GPU-powered scikit-learn? Training "simple" models with massive volumes of data at blazing speed? Isn't that the best?
Enter cuML by NVIDIA
cuML is part of NVIDIA's RAPIDS suite: a set of open-source, GPU-accelerated libraries built to supercharge data science and machine learning workflows. It ships with GPU-optimized implementations of popular ML algorithms like Random Forests, k-Nearest Neighbors, PCA, k-Means, etc.
%%load_ext cuml.accel
# Your existing scikit-learn code here
The craziest thing is it's "Zero-Code Change Acceleration"
As their docs show, just by adding the magic command above on top of your existing scikit-learn workflows, things like .fit() and .predict() seamlessly offload to the GPU.
%%load_ext cuml.accel
# Certain operations in common ML libraries (sklearn, umap, hdbscan)
# are now GPU accelerated
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet
X, y = make_regression(n_samples=1_000_000)
model = ElasticNet()
model.fit(X, y) # runs on GPU!
model.predict(X) # runs on GPU!
Source: RAPIDS cuML Acceleration Documentation
If that isn't cool, I don't know what is. In a world where data volume keeps climbing, this kind of acceleration isn't just "nice to have". it's setting the bar up!
Looking Ahead
But let's think even further ahead. What about specialized accelerators? TPUs? NPUs? What about running frameworks on those DL frameworks? Would we even have enough data at that time? Many questions, no easy answers ~(˘▾˘~)
Stay Curious