Author: Reshama Shaikh
Introduction
Use the function check_scalar for parameters validation. The validation function checks to see the following for a parameter: is an acceptable data type, is within the range of values, the range of values (interval).
- References Issue #21927 (@reshamas)
- References Issue #20724: “Use check_scalar for parameters validation” (with notes by @glemaitre, @jjerphan, @genvalen)
- References PR #20723. “MNT use check_scalar to validate scalar in AffinityPropagation”. This is an example PR by @glemaitre.
A helper function exists in scikit-learn which validates a scalar value: sklearn.utils.check_scalar documentation. 
It is used to validate parameters of classes (and functions). Most of the current classes in scikit-learn do not use this helper function.  We want to refactor the code so that it does use this standard helper function. Utilizing this helper function will help to get consistent error types and messages.
Steps
Below, I go through an example, step by step.
Go to working directory
pwd
▶ pwd
/Users/reshamashaikh/software-build/scikit-learn
(base) 
~/software-build/scikit-learn  main ✔    
Activate virtual environment
conda activate sklearndev
▶ conda activate sklearndev
(sklearndev) 
~/software-build/scikit-learn  main ✔    
Sync local repo with the GitHub repo, main branch
git pull upstream main
git push origin main
▶ git pull upstream main
From github.com:scikit-learn/scikit-learn
 * branch                main       -> FETCH_HEAD
Already up to date.
(sklearndev) 
~/software-build/scikit-learn  main ✔                                                                    1d  
▶ git push origin main
Everything up-to-date
(sklearndev) 
~/software-build/scikit-learn  main ✔                                                                    1d  
▶ 
Create a new working branch, from main branch
git checkout main
git checkout -b xscalar_glm
▶ git checkout main
Already on 'main'
Your branch is up to date with 'origin/main'.
(sklearndev) 
~/software-build/scikit-learn  main ✔                                                                    1d  
▶ git checkout -b xscalar_glm
Switched to a new branch 'xscalar_glm'
(sklearndev) 
~/software-build/scikit-learn  xscalar_glm ✔                                                             1d  
▶ 
Identify a class to implement check_scalar function
To find an algorithm which may need to implement check_scalar function, I searched the repo scikit-learn/scikit-learn for max_iter, as a start. I found a constructor that has scalar numeric as parameters.
I found:
- File: sklearn/linear_model/glm.py
- Associated test: sklearn/linear_model/_glm/tests/test_glm.py
Identify the scalar numeric parameters
For glm.py, I found four classes in the file:
- GeneralizedLinearRegressor
- PoissonRegressor
- GammaRegressor
- TweedieRegressor
I will begin work on the first one, GeneralizedLinearRegressor. Also, for each I will look at minimum and maximum values. If minimum and maximum values are missing, I will add them, as well as the boundary conditions.
Within the class GeneralizedLinearRegressor, I identify the following scalar numeric parameters:
- alpha, value range:- [0.0, inf)
- max_iter, value range:- [1, inf)
- tol, value range:- (0.0, inf)
- verbose, value range:- [1, inf)
Tests
Tests and validation
Parameter validation checks are added in order to catch any invalid parameter values passed into the estimator before the algorithm is run. If no parameter validation exists, we are left to the mercy of the algorithm. For instance, if the algorithm receives a negative number for maximum number of iterations, it will break.
Sklearn has thorough validation checks. With the use of the helper function, check_scalar, these validation checks can be refactored for greater consistency and readability.
Tests are added to make sure that parameter validation checks behave correctly. In the case of creating tests for check_scalar, the tests check that the check_scalar validation raises a ValueError or a TypeError where appropriate, and that the error message returned is as expected.
If no tests exists for the parameter validation, add tests. Note that even if the tests do not exist, the validation definitely does.
See if tests exists
In the file test_glm.py, I see the following test exists. It checks 5 possible inputs, but has only one ValueError error message:
@pytest.mark.parametrize("max_iter", ["not a number", 0, -1, 5.5, [1]])
def test_glm_max_iter_argument(max_iter):
    """Test GLM for invalid max_iter argument."""
    y = np.array([1, 2])
    X = np.array([[1], [2]])
    glm = GeneralizedLinearRegressor(max_iter=max_iter)
    with pytest.raises(ValueError, match="must be a positive integer"):
        glm.fit(X, y)
In this case, these are invalid values for max_iter: ["not a number", 0, -1, 5.5, [1]]
- “not a number”: invalid type (string), should be integer
- 5.5: invalid type (float), should be integer
- [1]: invalid type (list), should be integer
- 0: iterations should be > 0
- -1: iterations should be > 0
So, here we have 5 tests to run. And, our tests should give informative error messages.
In the glm.py file, I temporarily comment out whatever checks exist for valid values (validation) of max_iter.
        # if not isinstance(self.max_iter, numbers.Integral) or self.max_iter <= 0:
        #     raise ValueError(
        #         "Maximum number of iteration must be a positive "
        #         "integer;"
        #         " got (max_iter={0!r})".format(self.max_iter)
        #     )
Then, I run the existing test test_glm_max_iter_argument:
pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_max_iter_argument -vsl
I see that 5 tests have failed:
- max_iter=- 'not a number'- > if n_iterations >= maxiter: E TypeError: '>=' not supported between instances of 'int' and 'str' ../../miniforge3/envs/sklearndev/lib/python3.9/site-packages/scipy/optimize/lbfgsb.py:367: TypeError
- max_iter=- 0- > glm.fit(X, y) E Failed: DID NOT RAISE <class 'ValueError'> sklearn/linear_model/_glm/tests/test_glm.py:150: Failed
- max_iter=- -1- > glm.fit(X, y) E Failed: DID NOT RAISE <class 'ValueError'> sklearn/linear_model/_glm/tests/test_glm.py:150: Failed
- max_iter=- 5.5- > glm.fit(X, y) E Failed: DID NOT RAISE <class 'ValueError'> sklearn/linear_model/_glm/tests/test_glm.py:150: Failed
- max_iter=- [1]- > if n_iterations >= maxiter: E TypeError: '>=' not supported between instances of 'int' and 'list' ../../miniforge3/envs/sklearndev/lib/python3.9/site-packages/scipy/optimize/lbfgsb.py:367: TypeError
Add parametrized tests
The tests must fail before adding validation. This is an example of how we will add a parametrized test:
Current:
@pytest.mark.parametrize("max_iter", ["not a number", 0, -1, 5.5, [1]])
def test_glm_max_iter_argument(max_iter):
    """Test GLM for invalid max_iter argument."""
    y = np.array([1, 2])
    X = np.array([[1], [2]])
    glm = GeneralizedLinearRegressor(max_iter=max_iter)
    with pytest.raises(ValueError, match="must be a positive integer"):
        glm.fit(X, y)
We will update the test as we have done below:
@pytest.mark.parametrize(
    "params, err_type, err_msg",
    [
        ({"max_iter": 0}, ValueError, "max_iter == 0, must be >= 1"),
        ({"max_iter": -1}, ValueError, "max_iter == -1, must be >= 1"),
        (
            {"max_iter": "not a number"},
            TypeError,
            "max_iter must be an instance of <class 'numbers.Integral'>, not <class"
            " 'str'>",
        ),
        (
            {"max_iter": [1]},
            TypeError,
            "max_iter must be an instance of <class 'numbers.Integral'>,"
            " not <class 'list'>",
        ),
        (
            {"max_iter": 5.5},
            TypeError,
            "max_iter must be an instance of <class 'numbers.Integral'>,"
            " not <class 'float'>",
        ),
    ],
)
def test_glm_scalar_argument(params, err_type, err_msg):
    """Test GLM for invalid max_iter argument."""
    y = np.array([1, 2])
    X = np.array([[1], [2]])
    glm = GeneralizedLinearRegressor(**params)
    with pytest.raises(err_type, match=err_msg):
        glm.fit(X, y)
I run the tests.
Note: I have renamed the test function.
pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument
The tests fail, as expected, because invalid values are being input.
E           ValueError: Maximum number of iteration must be a positive integer; got (max_iter=5.5)
sklearn/linear_model/_glm/glm.py:232: ValueError
==================================================== 5 failed in 0.59s =====================================================
(sklearndev) 
Add and run validation
Next, in the glm.py file, I do two things:
- Import the needed function
    from ...utils import check_scalar
- Add in the check_scalarfunction in thedef fitfunction. The function here checks that formax_iteris:- an integer
- has a has a minimum value of 1
- has no maximum value
- is within this range: [1, ). Note that no upper bound is specified.
 
        check_scalar(
            self.max_iter,
            name="max_iter",
            target_type=numbers.Integral,
            min_val=1,
            max_val=None,
            include_boundaries="left",
        )
Confirm tests are passing!
After doing the above, we see that all 5 tests are now passing:
~/software-build/scikit-learn  xscalar_glm ✔                                                                                           8d  
▶ pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_scalar_argument -vsl
=========================================================== test session starts ============================================================
platform darwin -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/reshamashaikh/miniforge3/envs/sklearndev/bin/python
cachedir: .pytest_cache
rootdir: /Users/reshamashaikh/software-build/scikit-learn, configfile: setup.cfg
plugins: cov-3.0.0
collected 78 items / 73 deselected / 5 selected                                                                                            
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params0-ValueError-max_iter == 0, must be >= 1] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params1-ValueError-max_iter == -1, must be >= 1] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params2-TypeError-max_iter must be an instance of <class 'numbers.Integral'>, not <class 'str'>] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params3-TypeError-max_iter must be an instance of <class 'numbers.Integral'>, not <class 'list'>] PASSED
sklearn/linear_model/_glm/tests/test_glm.py::test_glm_scalar_argument[params4-TypeError-max_iter must be an instance of <class 'numbers.Integral'>, not <class 'float'>] PASSED
===================================================== 5 passed, 73 deselected in 0.23s =====================================================
(sklearndev) 
Reminders
When submitting the pull request (PR):
- Label PR with prefix “MAINT”
- A changelog entry is not required
Resources
Rebuild source code
If tests are failing, I may need to rebuild the source code, using below syntax:
pip install -e . --no-build-isolation -v
or
python setup.py clean
pip install --verbose --no-build-isolation --editable .
Run full test suite in sklearn
To run the full suite of tests, it takes about 20 minutes on my computer.
pytest sklearn
There is example output of the tests in 2021-12-12-pytest_sklearn_output.md
E       AssertionError: 
E         This test fails because scikit-learn has been built without OpenMP.
E         This is not recommended since some estimators will run in sequential
E         mode instead of leveraging thread-based parallelism.
E         
E         You can find instructions to build scikit-learn with OpenMP at this
E         address:
E         
E             https://scikit-learn.org/dev/developers/advanced_installation.html
E         
E         You can skip this test by setting the environment variable
E         SKLEARN_SKIP_OPENMP_TEST to any value.
E         
E       assert False
E        +  where False = _openmp_parallelism_enabled()
sklearn/tests/test_build.py:33: AssertionError
===== 1 failed, 25839 passed, 205 skipped, 250 xfailed, 62 xpassed, 2290 warnings in 1002.24s (0:16:42) ======
(sklearndev) 
~/software-build/scikit-learn  xscalar_glm ✔  
Running Individual Tests
Typically, to run the full test suite, I would type pytest sklearn, which takes about 20 minutes.
Individual tests can be run using the syntax below, there are a couple of ways to do it:
pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_max_iter_argument -vsl
pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_max_iter_argument
This is the output observed after running the test.
▶ pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_max_iter_argument
=================================================== test session starts ====================================================
platform darwin -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /Users/reshamashaikh/software-build/scikit-learn, configfile: setup.cfg
plugins: cov-3.0.0
collected 5 items                                                                                                          
sklearn/linear_model/_glm/tests/test_glm.py .....                                                                    [100%]
==================================================== 5 passed in 0.17s =====================================================
(sklearndev) 
~/software-build/scikit-learn  xscalar_glm ✔  
Because I consolidated some existing tests and added the new ones, I renamed the test. I would run the following for the test:
pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_scalar_argument -vsl
Acknowledgements
- Guillaume LeMaitre @glemaitre
- Julien Jerphanon @jjerphan
- Thomas J. Fan @thomasjpfan
- Genesis Valencia @genvalen
Part 2: PoissonRegressor
- Virtual environment activated:  conda activate sklearndev
- Identify class to work on:  PoissonRegressor
- Working with this file: sklearn/linear_model/glm.py
- Working with associated test: sklearn/linear_model/_glm/tests/test_glm.py
- Create working branch from mainbranchgit checkout main git pull upstream main git checkout -b xscalar_poissonreg
- Identify scalar numerical parameters and the valid range of values for the class PoissonRegressor- alpha, value range:- [0.0, inf)
- max_iter, value range:- [1, inf)
- tol, value range:- (0.0, inf)
- verbose, value range:- [1, inf)
 
- Add parameter interal ranges to the docstring
    - alpha, Values should be in the range- [0.0, inf).
- max_iter, Values should be in the range- [1, inf).
- tol, Values should be in the range- (0.0, inf).
- verbose, Values should be in the range- [1, inf).
 
- Run tests:  pytest sklearn/linear_model/_glm/tests/test_glm.py -k test_glm_scalar_argument -vsl
- There is no def fitfor classPoissonRegressor
 
             
         
       
      