Search configuration
See "Code search overview" for general information about Sourcegraph's code search.
General
Maximum file size
By default, files larger than 1 MB are excluded from search results. Use the search.largeFiles keyword to specify files to be indexed and searched regardless of size. Regardless of where you set the search.largeFiles
environment variable, Sourcegraph will continue to ignore binary files, even if the size of the file is less than the limit you set.
Maximum timeout
By default, users cannot set a search timeout larger than 1 minute. Sourcegraph admins can increase the maximum timeout through site configuration:
JSON"search.limits": { "maxTimeoutSeconds": 120, },
Exclude files and directories
You can exclude files and directories from search by adding the file .sourcegraph/ignore
to the root directory of your repository. Sourcegraph interprets each line in the ignore file as a glob pattern. Files or directories matching those patterns will not show up in the search results.
The ignore
file is tied to a commit. This means that if you committed an ignore
file to a feature branch but not to your default branch, then only search results for the feature branch will be filtered, while the default branch will show all results.
For example:
BASH// .sourcegraph/ignore // lines starting with "//" are comments and are ignored // empty lines are ignored, too // ignore the directory node_modules/ node_modules/ // ignore the directory src/data/ src/data/ // ** matches all characters, while * matches all characters except / // ignore all JSON files **.json // ignore all JSON files at the root of the repository *.json // ignore all JSON files within the directory data/ data/**.json // ignore all data folders data/ **/data/ // ignore all files that start with numbers [0-9]*.* **/[0-9]*.*
Our syntax follows closely what is documented in the linux documentation project. However, we distinguish between *
and **
: While **
matches all characters, *
matches all characters except the path separator /
.
Note that invalid globbing patterns will cause an error and searches over commits containing a broken ignore
file
will not return any result.
Indexed search
Sourcegraph indexes the code on the default branch of each repository. This speeds up searches that hit many repositories at once. Not all files in a repository branch are indexed, we skip files that are larger than 1 MB and binary files. To view which files are skipped during indexing, visit the repository settings page and click on indexing.
For large deployments we recommend horizontally scaling indexed search. You can do this by adjusting the number of replicas. Sourcegraph shards repository indexes across replicas. When the replica count changes Sourcegraph will slowly rebalance indexes to ensure availability of existing indexes.
The resource requirements for indexed search vary considerably based on the text contents of your repositories, but a good estimate is that the node should have enough memory to hold the entire text contents of the default branch of each repository.
Multi-branch indexing
Sourcegraph always maintains an index of the source code on your default branch. Some organizations have other branches that are regularly searched. To speed up search for those branches, Sourcegraph can be configured to index up to 64 branches per repository.
You can configure indexed branches in site configuration under the experimentalFeatures.search.index.branches
setting. For example:
JSON"experimentalFeatures": { "search.index.branches": { "github.com/sourcegraph/sourcegraph": ["3.15", "develop"], "github.com/sourcegraph/src-cli": ["next"] } }
Indexing multiple branches will add additional resource requirements to Sourcegraph (particularly memory). The indexer will deduplicate documents between branches. So the size of your index will grow in relation to the number of unique documents. Refer to our resource estimator to estimate whether additional resources are required.
Scaling considerations
Zoekt is Sourcegraph's indexing engine. When processing a repository, Zoekt splits the index it creates across one or more files on disk. These files are called "shards."
Zoekt uses memory maps to load all shards into memory to evaluate search queries. In most deployments, the total size of the search index (all the shards on disk) is much larger than the total amount of RAM available to Zoekt. Amongst other benefits, using memory maps allows Zoekt to:
- Leverage demand paging: only read the shard file from the disk if Zoekt tries to read that portion of the file (and it isn't already in RAM)
- Leverage the kernel's page cache: keep the most frequently accessed pages in RAM and evict them when the system is under memory pressure
Consideration: RAM versus Disk
As stated above, Zoekt uses RAM as a cache for accessing shards while evaluating a search query.
-
The more RAM you give Zoekt, the more shards it can hold in RAM before it has to access the disk.
-
The less RAM you give Zoekt, the more often it will have to access the disk to read the data it needs from its shards. If Zoekt has to access the disk more often, this negatively impacts search performance, increases disk utilization, etc.
Tuning Zoekt's resource requirements is a balance between:
- The amount of RAM you are willing to allocate
- The amount of disk i/o resources that you have available in your environment
- The impact on search performance that you find acceptable
Consideration: Available memory maps
As stated above, Zoekt uses memory maps to load all of its shards into memory.
There is a limit to the number of memory maps a process can create on Linux. On most systems, the default limit is 65536 maps. Processes, including Zoekt, will be terminated if they attempt to allocate more memory maps than that limit.
This limit provides a ceiling for the number of shards a Zoekt instance can store. When capacity planning, you can estimate the amount of Zoekt instances (including some slack for growth) you'll need via the following rule of thumb:
TEXT(number of repositories) / (60% * (memory map limit))
So, assuming that you have 300,000 repositories and a memory map limit of 65,536, this results in:
TEXT(300,000) / (.6 * 65,536) = 7.62 => 8 instances
Sourcegraph's monitoring system also includes an alert for this scenario and mitigation steps.
Shard merging
Shard merging is a feature of Zoekt which combines smaller index files, or shards, into one larger file, a compound shard.
This reduces memory requirements for Zoekt webserver and improves performance of searches that are bound by "work done per shard".
This feature is particularly useful for customers with many small and rarely updated repositories.
Shard merging is on by default. It can be disabled by setting the ENV SRC_DISABLE_SHARD_MERGING="true"
for Zoekt indexserver.
Shard merging can be fine-tuned by setting ENV variables for Zoekt indexserver:
Env Variable | Description | Default |
---|---|---|
SRC_VACUUM_INTERVAL | Run vacuum this often, specified as a duration | 24 hours |
SRC_MERGE_INTERVAL | Run merge this often, specified as a duration | 8 hours |
SRC_MERGE_TARGET_SIZE | The target size of compound shards in MiB | 2000 |
SRC_MERGE_MIN_SIZE | The minimum size of a compound shard in MiB | 1800 |
SRC_MERGE_MIN_AGE | The time since the last commit in days. Shards with newer commits are excluded from merging. | 7 |
When repostiories receive udpates, Zoekt reindexes them and tombstones their old index data. As a result, compound shards can shrink and be dismantled into individual shards once they reach a critical minimum size. These individual shards are then considered for future merge operations. Shard merging can be monitored via the "Compound shards" panel in Zoekt's Grafana dashboard.