Documentation is imperative for project sustainability, yet often overlooked. Millions of NumPy stakeholders will benefit from this initiative. Melissa, Mars and Ralf, thank you for taking a lead on this!
Authors: Ralf Gommers, Melissa Mendonça, Mars Lee Status: Draft Type: Process Created: 2020-02-11
This document proposes a restructuring of the NumPy Documentation, both in form and content, with the goal of making it more organized and discoverable for beginners and experienced users.
Motivation and Scope
See [here](numpy.org/devdocs) for the front page of the latest docs. The organization is quite confusing and illogical (e.g. user and developer docs are mixed). We propose the following:
- Reorganizing the docs into the four categories mentioned in ; - Creating dedicated sections for Tutorials and How-Tos, including orientation on how to create new content; - Adding an Explanations section for key concepts and techniques that require deeper descriptions, some of which will be rearranged from the Reference Guide.
Usage and Impact
The documentation is a fundamental part of any software project, especially open source projects. In the case of NumPy, many beginners might feel demotivated by the current structure of the documentation, since it is difficult to discover what to learn (unless the user has a clear view of what to look for in the Reference docs, which is not always the case).
Looking at the results of a “NumPy Tutorial” search on any search engine also gives an idea of the demand for this kind of content. Having official high-level documentation written using up-to-date content and techniques will certainly mean more users (and developers/contributors) are involved in the NumPy community.
The restructuring will effectively demand a complete rewrite of links and some of the current content. Input from the community will be useful for identifying key links and pages that should not be broken.
As discussed in the article , there are four categories of doc content: - Tutorials - How-to guides - Explanations - Reference guide
We propose to use those categories as the ones we use (for writing and reviewing) whenever we add a new documentation section.
The reasoning for this is that it is clearer both for developers/documentation writers and to users where each information should go, and the scope and tone of each document. For example, if explanations are mixed with basic tutorials, beginners might be overwhelmed and alienated. On the other hand, if the reference guide contains basic how-tos, it might be difficult for experienced users to find the information they need, quickly.
Currently, there are many blogs and tutorials on the internet about NumPy or using NumPy. One of the issues with this is that if users search for this information and end up in an outdated (unofficial) tutorial before they find the current official documentation, they end up creating content that is confusing, especially for beginners. Having a better infrastructure for the documentation also aims to solve this problem by giving users high-level, up-to-date official documentation that can be easily updated.
NumPy has a quite complete reference guide. All functions are documented, most have examples, and most are cross-linked well with See Also sections. Further improving the reference guide is incremental work that can be done (and is being done) by many people. There are, however, many explanations in the reference guide. These can be moved to a more dedicated Explanations section on the docs.
* How-to guides
NumPy does not have many how-to’s. The subclassing and array ducktyping section may be an example of a how-to. Others that could be added are: - Parallelization (controlling BLAS multithreading with threadpoolctl, using multiprocessing, random number generation, etc.) - Storing and loading data (.npy/.npz format, text formats, Zarr, HDF5, Bloscpack, etc.) - Performance (memory layout, profiling, use with Numba, Cython, or Pythran) - Writing generic code that works with NumPy, Dask, CuPy, pydata/sparse, etc.
There is a reasonable amount of content on fundamental NumPy concepts such as indexing, vectorization, broadcasting, (g)ufuncs, and dtypes. This could be organized better and clarified to ensure it’s really about explaining the concepts and not mixed with tutorial or how-to like content.
There are few explanations about anything other than those fundamental NumPy concepts.
Some examples of concepts that could be expanded:
- Copies vs. Views;
- BLAS and other linear algebra libraries;
- Fancy indexing.
In addition, there are many explanations in the Reference Guide, which should be moved to this new dedicated Explanations section.
There’s a lot of scope for writing better tutorials. We have a new NumPy for absolute beginners tutorial  (GSoD project of Anne Bonner). In addition we need a number of tutorials addressing different levels of experience with Python and NumPy. This could be done using engaging data sets, ideas or stories. For example, curve fitting with polynomials and functions in numpy.linalg could be done with the Keeling curve (decades worth of CO2 concentration in air measurements) rather than with synthetic random data.
Ideas for tutorials (these capture the types of things that make sense, they’re not necessarily the exact topics we propose to implement): - Conway’s game of life with only NumPy (note: already in Nicolas Rougier’s book) - Using masked arrays to deal with missing data in time series measurements - Using Fourier transforms to analyze the Keeling curve data, and extrapolate it. - Geospatial data (e.g. lat/lon/time to create maps for every year via a stacked array, like gridMet data) - Using text data and dtypes (e.g. use speeches from different people, shape (n_speech, n_sentences, n_words))
The Preparing to Teach document  from the Software Carpentry Instructor Training materials is a nice summary of how to write effective lesson plans (and tutorials would be very similar). In addition to adding new tutorials, we also propose a How to write a tutorial document, which would help users contribute new high-quality content to the documentation.
Using interesting data in the NumPy docs requires giving all users access to that data, either inside NumPy or in a separate package. The former is not the best idea, since it’s hard to do without increasing the size of NumPy significantly. Even for SciPy there has so far been no consensus on this (see scipy PR 8707 on adding a new scipy.datasets subpackage).
So we’ll aim for a new (pure Python) package, named numpy-datasets or scipy-datasets or something similar. That package can take some lessons from how, e.g., scikit-learn ships data sets. Small data sets can be included in the repo, large data sets can be accessed via a downloader class or function.
These projects make the intended audience for each part of the documentation more explicit, as well as previewing some of the content in each section.
Besides rewriting the current documentation to some extent, it would be ideal to have a technical infrastructure that would allow more contributions from the community. For example, if Jupyter Notebooks could be submitted as-is as tutorials or How-Tos, this might create more contributors and broaden the NumPy community.
Similarly, if people could download some of the documentation in Notebook format, this would certainly mean people would use less outdated material for learning NumPy.
It would also be interesting if the new structure for the documentation makes translations easier.
Currently, the documentation for NumPy can be confusing, especially for beginners. Our proposal is to reorganize the docs in the following structure:
* For users: - Absolute Beginners Tutorial - main Tutorials section - How To’s for common tasks with NumPy - Reference Guide - Explanations - F2Py Guide - Glossary
* For developers/contributors: - Contributor’s Guide - Building and extending the documentation - Benchmarking - NumPy Enhancement Proposals
* Meta information - Reporting bugs - Release Notes - About NumPy - License