Below is a non-exhaustive but representative set of prior art references reasonably pertinent to claims directed to sparsifying Transformer self-attention (i.e., constraining, masking, selecting, or otherwise reducing pairwise attention computations to achieve sub-quadratic complexity or sparsity in the attention pattern). Each entry identifies the earliest publicly available date known with reasonable certainty and explains the material teachings relevant to novelty and obviousness analyses under 35 U.S.C. §§ 102 and 103. Where appropriate, I indicate how a person of ordinary skill in the art (POSITA) would have been motivated to combine or adapt these teachings.
A. Foundational structured-sparse attention patterns (fixed, local, dilated, global)
- Child, Gray, Radford, Sutskever, “Generating Long Sequences with Sparse Transformers,” arXiv:1904.10509 (first posted Apr. 24, 2019).
- Material teaching: Introduces block-sparse self-attention with predefined local, strided/dilated, and fixed sparse patterns enabling O(n√n) or related sub-quadratic behavior. Shows that structured masks preserve long-range dependencies while avoiding full O(n2) cost.
- Relevance: Anticipates or renders obvious claim elements reciting block-sparse attention, strided/dilated patterns, mask-based sparsification, or hybrid local/global patterns implemented at the head- or layer-level. Provides enabling detail on mask construction and training/inference behavior.
- Beltagy, Peters, Cohan, “Longformer: The Long-Document Transformer,” arXiv:2004.05150 (first posted Apr. 10, 2020).
- Material teaching: Sliding-window attention plus a limited set of global tokens with full connectivity; optionally dilated windows. Achieves linear-time attention with task-specific global tokens (e.g., CLS).
- Relevance: Anticipates or renders obvious claims directed to windowed attention, dilated windows, and “global” token subsets with unrestricted attention while the remaining tokens are restricted to local windows.
- Ainslie et al., “ETC: Encoding Long and Structured Inputs in Transformers,” arXiv:2004.08483 (first posted Apr. 18, 2020).
- Material teaching: Global-local attention with two token types (global and long), sparse connectivity between them, and restricted long-to-long attention (e.g., within segments). Introduces structured sparse patterns for long documents.
- Relevance: Anticipates claims reciting multi-partition token sets with asymmetric sparse connectivity or hierarchical/global “hub” tokens, and supports obviousness combinations with other windowed or block patterns.
- Zaheer et al., “Big Bird: Transformers for Longer Sequences,” arXiv:2007.14062 (first posted Jul. 28, 2020).
- Material teaching: Combines three sparse components—random, sliding window, and global attention—to approximate dense attention with strong theoretical guarantees (e.g., universal approximation, Turing completeness). Empirically effective on long sequences.
- Relevance: Anticipates claims covering hybrid sparse masks that include any mix of windowed, random, and global edges; further supports §103 arguments that adding random or global edges to local windows would have been an obvious robustness enhancement.
- Parmar et al., “Image Transformer,” arXiv:1802.05751 (first posted Feb. 16, 2018).
- Material teaching: Local (block) attention and restricted receptive fields for 2D/1D sequences to reduce cost in visual and sequence domains.
- Relevance: Early disclosure of locality-constrained attention patterns. Suggests obviousness of sliding-window or block-local attention for other modalities (e.g., text).
- Ho, Kalchbrenner, Weissenborn, Salimans, “Axial Attention in Multidimensional Transformers,” arXiv:1912.12180 (first posted Dec. 23, 2019).
- Material teaching: Factorizes full attention along axes (rows/columns), reducing complexity by decomposing global attention into multiple sparse passes.
- Relevance: Anticipates claim elements that factorize attention into sparse sub-operations achieving sub-quadratic complexity; provides alternatives to block/window masks.
B. Content- or data-dependent sparsification (hashing, clustering, top-k)
7) Kitaev, Kaiser, Levskaya, “Reformer: The Efficient Transformer,” arXiv:2001.04451 (first posted Jan. 13, 2020).
- Material teaching: Locality-sensitive hashing (LSH) attention that routes tokens to buckets so attention is computed within buckets, yielding sub-quadratic complexity. Also introduces chunking and reversible layers.
- Relevance: Anticipates claims reciting LSH-based routing or content-dependent grouping to sparsify attention. Provides enabling details for hash construction, bucket attention, and complexity.
- Roy, Saffar, Vaswani, Grangier, “Routing Transformers,” arXiv:2003.05997 (first posted Mar. 13, 2020).
- Material teaching: Online k-means-like clustering of token representations to route attention within clusters (sparse), improving efficiency and quality.
- Relevance: Anticipates claims involving clustering-based routing, nearest-neighbor grouping, or dynamic partitions that bound attention neighborhood size per query.
- Zhou et al., “Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting,” arXiv:2012.07436 (first posted Dec. 14, 2020).
- Material teaching: ProbSparse self-attention that selects “dominant” queries based on score distributions (e.g., via KL divergence proxies), computing attention only for selected queries.
- Relevance: Anticipates claims to score-driven or probabilistic top-k selection of attention computations, query pruning, or sparsified score matrices based on distributional criteria.
C. Random or hybrid sparsity and theoretical guarantees
10) BigBird (Zaheer et al., 2020) as above.
- Material teaching: Random edges + local windows + global anchors; theoretical proofs that such sparse graphs approximate dense attention properties.
- Relevance: Strong reference against novelty for claims that combine random, local, and global connectivity with asserted theoretical capacity benefits.
- Longformer (Beltagy et al., 2020) and ETC (Ainslie et al., 2020) as above.
- Relevance: Together with BigBird, these references provide a comprehensive set of hybrid sparse patterns pre-dating many later applications.
D. Low-rank/kernel approximations as closely related alternatives (obviousness rationales)
12) Wang et al., “Linformer: Self-Attention with Linear Complexity,” arXiv:2006.04768 (first posted Jun. 8, 2020).
- Material teaching: Projects keys/values to low-rank spaces to achieve linear complexity.
- Relevance: While not “sparse masks,” teaches the same problem (sub-quadratic attention) and would motivate a POSITA to consider structured sparsity as an alternative known path to the same efficiency goal.
- Choromanski et al., “Rethinking Attention with Performers,” arXiv:2009.14794 (first posted Sep. 30, 2020).
- Material teaching: Kernel-based FAVOR+ linear attention approximations.
- Relevance: Supports §103 combinations showing that, by 2020, multiple families of solutions (sparse masks, hashing/clustering, low-rank, and kernelization) were well known; choosing a sparse-mask embodiment would have been one of a finite set of predictable options to reduce cost.
- Katharopoulos et al., “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention,” arXiv:2006.16236 (first posted Jun. 29, 2020).
- Material teaching: Linear attention via kernel trick; O(n) complexity.
- Relevance: Same as above—alternative efficiency pathway indicative of the state of the art and motivations to avoid dense O(n2) operations.
E. Earlier block/local or structured attention and sparsity-inducing distributions
15) Shen et al., “Bi-BloSAN: Bidirectional Block Self-Attention Network for Fast Text Classification,” arXiv:1804.07094 (first posted Apr. 19, 2018).
- Material teaching: Block-wise restricted self-attention to reduce complexity for sequences while preserving bidirectional context through hierarchical blocks.
- Relevance: Pre-Transformer or contemporaneous structured sparse self-attention paradigm; anticipates block-based sparsity rationales and hierarchical masking.
- Martins, Astudillo, “From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification,” arXiv:1602.02068 (first posted Feb. 6, 2016).
- Material teaching: Sparsemax yields exactly sparse attention distributions (zeroing some weights) as a drop-in replacement for softmax.
- Relevance: For claims reciting sparsity in the attention weight vector itself (as opposed to a hard mask on the score matrix), this anticipates using alternative normalizers to induce zeros. It also supports obviousness for combining mask-based sparsity with sparse normalizers to further reduce compute.
- Lee et al., “Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks,” arXiv:1810.00825 (first posted Oct. 1, 2018).
- Material teaching: Inducing point attention (ISAB) reduces quadratic cost by attending via a small set of learned inducing points, yielding structured sparsity of interactions.
- Relevance: Anticipates use of a small, learned “global” or “inducing” token set to mediate attention—closely related to global tokens in Longformer/ETC.
- Dai et al., “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context,” arXiv:1901.02860 (first posted Jan. 9, 2019).
- Material teaching: Segment-level recurrence and relative positions enabling long-context modeling with reduced recomputation.
- Relevance: Not sparse per se, but evidences the field’s recognized need to scale Transformers to longer sequences, motivating sparse designs.
- Xiong et al., “Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention,” arXiv:2102.03902 (first posted Feb. 7, 2021).
- Material teaching: Low-rank Nyström approximation to attention.
- Relevance: Alternate sub-quadratic approximation supporting obviousness rationales and combinations with sparse/global token concepts.
F. Survey and state-of-the-art syntheses
20) Tay, Dehghani, Bahri, Metzler, “Efficient Transformers: A Survey,” arXiv:2009.06732 (first posted Sep. 14, 2020).
- Material teaching: Systematic taxonomy of efficient attention mechanisms including structured sparsity (local, strided, global), hashing/clustering, low-rank and kernel methods, with citations to earlier works.
- Relevance: Demonstrates that, by late 2020, sparse attention patterns and their design space were well known. Useful to establish motivation to combine known sparse patterns and to rebut assertions of non-obviousness premised on “unexpected” efficiency/accuracy trade-offs.
G. Additional domain-specific sparse attention (contextual corroboration)
21) Huang et al., “CCNet: Criss-Cross Attention for Semantic Segmentation,” arXiv:1811.11721 (first posted Nov. 28, 2018).
- Material teaching: Row-and-column attention producing sparse connectivity patterns with broad receptive field at reduced cost.
- Relevance: Shows that structured sparsity in attention to enlarge receptive fields at sub-quadratic cost was known in vision; supports obviousness to apply analogous patterns in text.
Legal analysis considerations
- Anticipation (§102): Many of the above references expressly disclose fixed sparse masks (e.g., local/sliding windows, dilated/strided patterns), hybrid masks with global tokens, and random edges (Child 2019; Longformer 2020; ETC 2020; BigBird 2020). If an asserted claim reads on implementing attention through any predetermined subset of token pairs (with local windows and one or more global tokens) to achieve sub-quadratic complexity, Longformer and BigBird are particularly strong anticipatory art. Child (2019) is often the earliest widely cited disclosure of block-sparse Transformer attention.
- Obviousness (§103): Even where a claim adds (a) random edges to local windows, (b) a small set of global tokens, or (c) dilations/strides, the combination is taught or suggested across Child (2019), Longformer (2020), ETC (2020), and BigBird (2020). Content-based routing variants are taught by Reformer (LSH) and Routing Transformers (clustering). Score-based query pruning is taught by Informer. Given the well-documented goal of reducing O(n2) complexity, a POSITA would have had strong motivation to adopt any of these sparse patterns, with a reasonable expectation of success, as corroborated by Tay et al. (2020) and the widespread open-source implementations that followed.
- Enablement and written description (§112): The cited works provide sufficient implementation details for mask construction, complexity analysis, and training dynamics (e.g., handling global tokens, maintaining gradient flow across sparse patterns, batching). They are thus probative prior art for enablement and written description.
Targeted search guidance
- CPC/IPC classes: G06N 3/08 (artificial neural networks), G06F 17/18 (digital computing or data processing; machine learning), and cross-reference to G06F 15/16 or G06F 7/00 for algorithmic graph sparsification concepts. In practice, recent AI filings often concentrate in G06N 3/00 subclasses.
- Keywords/strings: “sparse attention,” “block-sparse attention,” “sliding window attention,” “dilated attention,” “global tokens,” “random attention edges,” “LSH attention,” “clustered attention,” “routing transformer,” “probabilistic attention,” “top-k attention,” “inducing points attention,” “Nyström attention,” “linear attention.”
- Likely assignees/authors: Google/DeepMind, Google Research/Brain (BigBird, Reformer, Routing Transformers, Performer, ETC, Nyströmformer); Allen Institute for AI (Longformer); Facebook/Meta (Linformer); OpenAI (Sparse Transformers); academic groups associated with the above arXiv works.
Use and limitations
- The precise relevance of each item will depend on the priority date and the specific claim language (e.g., whether the claim is limited to fixed masks vs. content-dependent selection; whether “sparsity” is in the mask or in the normalized attention distribution; whether particular complexity bounds, e.g., O(n), O(n log n), are recited).
- If you provide the asserted claims, I can prepare a claim chart mapping these references to specific limitations to support §§102/103 rejections.
In view of the above, the identified non-patent literature provides robust, enabling disclosures of Transformer attention sparsification via fixed masks, hybrid local/global/random connectivity, hashing/clustering-based routing, and score-driven query selection well before mid-2021, and would constitute strong prior art against later-filed claims to such techniques.