In some applications, e.g. modelling something defined by PDE, you might need to solve a very large linear system of equations Ax=b, where the vector b has 100,000,000 elements, and A is a 100,000,000 by 100,000,000 element matrix. So for a naive dense matrix representation of A using 1 byte per entry you'd need around 9 petabytes of storage. In practise, for many interesting systems of equations that encode "local" properties, A will be a sparse matrix where the overwhelming majority of matrix entries are zero. There are various sparse matrix formats designed to avoid storing zeroes: the simplest is COO where you store triples of the form (i, j, A_ij) for all nonzero entries A_ij of A. More efficient formats are CSR or CSC ( compressed sparse row & column respectively) that use even less memory by avoiding storing repeated indices where there are runs of nonzero elements.
To solve one of these huge linear systems Ax = b , we hardly ever want [1] to go to the expense of computing and storing an explicit matrix representation of the inverse A^{-1} .
Instead, we merely need to have some function f that is able to map an input vector b to some output vector f(b)=x that happens to satisfy Ax=b .
If the matrix A has sufficiently nice properties [2] we can use an efficient iterative method such as the conjugate gradient method to map the input vector b to an output vector x by solving Ax=b. To evaluate the conjugate gradient method, we don't need to necessarily have an explicit dense or sparse representation of the matrix A, we merely need a function that encodes how A acts on a vector by matrix multiplication. E.g. some function g that maps an input vector x to the output vector g(x) = Ax.
So if we've got some appropriately defined conjugate gradient method algorithm cg(g, b), we can define the function f(b) := cg(g, b)
Calling f(b) will return x that satisfies g(x) = b, i.e. f is the inverse of g. So now we're expressing linear algebra without having any explicit matrices anywhere: but our functions g and f are of course both linear operators that encode the action of matrix multiplication by A and A^{-1} respectively.
[1] Why not? It's a lot more effort to compute A^{-1} which could be used to evaluate A^{-1}b for any given b, but often we only want to know A^{-1}b for a specific b. Also, if the matrix A encodes some local property of a physical system , then A^{-1} will encode a global property -- so it will need to be a dense matrix.
For the more machine learning-minded folks, this happens almost always in doing inference with Exact Gaussian Processes (GP), where because of the non-parametric nature of a GP model, the covariance matrix grows with the size of data points. The inference routine is cubic in the number of data points. Hence, sparsity in the posited covariance matrix is _extremely_ important for fast inference.
And what happens if indeed the huge matrix elements are all non-zero? Like, let's say, satellite scan data for a given country when you want to spy on their underground systems (think North Korea facilities)? Wouldn't storing that data as COO would actually triple the amount of memory?
Presumable the designer(s) of such a system will be able to know in advance whether the resulting matrix will be sparse or not and choose their encoding appropriately. FWIW, for a lot of practical applications the raw sensor data would be non-sparse but it would be transformed/filtered almost immediately into a more space-efficient representation. For example in the case you mentioned (unless you think the entire country is completely underdug by tunnels) most of the raw data can be deleted immediately with no loss of signal as there is no underground system underneath that particular spot of land.
That's my point. In order to analyze you need to preserve entire data. One system is the image acquisition. Another is the analysis, on more advanced systems. The advanced system needs all data. Dismissing it at entry point defeats its purpose. No matter how you flip it, there are practical applications where huge matrices that have non-zero elements exists.
Here is another example - cryptography analysis. Usually password length is small, usually below 1k characters. But if you increase the password length then simple old ciphers, like Vigenere, become harder to crack. Bump the key length to over 1 million bytes and suddenly your century old methods, like index of coincidence and Kasiski examination become a problem of having huge matrices to feed to analyzer. 1 million key length suddenly generates a matrix with 256M x 256M for analyzer. And all elements in that matrix are non zero.
To solve one of these huge linear systems Ax = b , we hardly ever want [1] to go to the expense of computing and storing an explicit matrix representation of the inverse A^{-1} . Instead, we merely need to have some function f that is able to map an input vector b to some output vector f(b)=x that happens to satisfy Ax=b .
If the matrix A has sufficiently nice properties [2] we can use an efficient iterative method such as the conjugate gradient method to map the input vector b to an output vector x by solving Ax=b. To evaluate the conjugate gradient method, we don't need to necessarily have an explicit dense or sparse representation of the matrix A, we merely need a function that encodes how A acts on a vector by matrix multiplication. E.g. some function g that maps an input vector x to the output vector g(x) = Ax.
So if we've got some appropriately defined conjugate gradient method algorithm cg(g, b), we can define the function f(b) := cg(g, b)
Calling f(b) will return x that satisfies g(x) = b, i.e. f is the inverse of g. So now we're expressing linear algebra without having any explicit matrices anywhere: but our functions g and f are of course both linear operators that encode the action of matrix multiplication by A and A^{-1} respectively.
[1] Why not? It's a lot more effort to compute A^{-1} which could be used to evaluate A^{-1}b for any given b, but often we only want to know A^{-1}b for a specific b. Also, if the matrix A encodes some local property of a physical system , then A^{-1} will encode a global property -- so it will need to be a dense matrix.
[2] Sufficiently nice properties: if it's symmetric positive definite, which arises naturally in a bunch of situations. See eg https://en.m.wikipedia.org/wiki/Conjugate_gradient_method