Investigate discrepancy between non-BLAS and BLAS versions of `linfa-pls`
YuhanLiin opened this issue · 6 comments
According to benchmark results from this comment, linfa-pls
is slightly slower without BLAS. Specifically, the Regression-Nipals
benchmarks are slightly slower when the sample size is 100000, and the Regression-Svd
and Canonical-Svd
benchmarks are slower when the sample size is 100000.
Flamegraphs from profiling attached. I haven't studied profiling or flamegraphs to build a firm foundation yet so I can't provide any insights. Each profile was run for 1min.
Linfa_pls.zip
par_azip can be used in place of zip in our code as this is where a significant amount of time is spent.
Regression-Nipals-5feats-100_000 would benefit from the above the screenshots are attached with a black border around the area of interest.
similarly, Regression-Svd
it appears that a_parzip should be added to linfa_linalg per this call ndarray::linalg::impl_linalg::general_mat_vec_mul_impl
that leads to a `ndarray::zip...
linfa_pls::utils::outer
spends a significant time with ndarray::zip::Zip<P,D>::inner
there may be a faster alternative?
Canonical-Svd
looks like it would benefit from optimizing the linfa::param_guard::<impl linfa::traits::Fit<R,T,E> for P>::fit
method.
Our param_guard code
/// Performs checking step and calls `fit` on the checked hyperparameters. If checking failed, the
/// checking error is converted to the original error type of `Fit` and returned.
impl<R: Records, T, E, P: ParamGuard> Fit<R, T, E> for P
where
P::Checked: Fit<R, T, E>,
E: Error + From<crate::error::Error> + From<P::Error>,
{
type Object = <<P as ParamGuard>::Checked as Fit<R, T, E>>::Object;
fn fit(&self, dataset: &crate::DatasetBase<R, T>) -> Result<Self::Object, E> {
let checked = self.check_ref()?;
checked.fit(dataset)
}
}
and its implementation for pls
impl<F: Float> ParamGuard for [<Pls $name Params>]<F> {
type Checked = [<Pls $name ValidParams>]<F>;
type Error = PlsError;
fn check_ref(&self) -> Result<&Self::Checked, Self::Error> {
if self.0.0.tolerance.is_negative() || self.0.0.tolerance.is_nan() || self.0.0.tolerance.is_infinite() {
Err(PlsError::InvalidTolerance(self.0.0.tolerance.to_f32().unwrap()))
} else if self.0.0.max_iter == 0 {
Err(PlsError::ZeroMaxIter)
} else {
Ok(&self.0)
}
}
fn check(self) -> Result<Self::Checked, Self::Error> {
self.check_ref()?;
Ok(self.0)
}
}
The ParamGuard
code only validates the hyperparameters, so it shouldn't be the bottleneck.
Hmm did I interpret the flamegraph wrong? I thought since it was at the top most of the CPU time was spent there or I guess it's possible that it's saying most of the time is spent on the ::fit
method implementation for ParamGuard
?
That's what I think is happening, have to look at it to make sure. All I know is that there's no way the ParamGuard
code can be the bottleneck,