Many web sites consist of thousands, tens of thousands, or even a few hundred thousand links. The initial approach to a scanning a 300,000 link site might be to give the scanner a starting URI, launch a scan, then sit back while the scanner attempts to navigate the entire site. This approach isn't wrong, but it is inefficient and often unnecessary.
There's good motivation for scanning every link of a web site: coverage. Coverage refers to the amount of the web application exercised by the scanner. It's an important factor to security testing because, in simple terms, a link that isn't crawled is a link that isn't tested. However, the real goal isn't coverage of links, but coverage of functionality. The site's functionality is responsible for accepting user input and reacting to it in some manner. This is where vulnerabilities occur.
Scanning 300,000+ links isn't feasible from either a scanner perspective or a time perspective. Many of the links could be redundant. For example, there could be 10,000 links that point to a database record like /document.cgi?id=1 through /document.cgi?id=10000. In such a case there's no need to go through all 10,000 to test the page's functionality with regard to XSS and SQL injection; it's just necessary to test the id parameter.
There are counter-arguments to the id example. Many site design patterns use parameters to control different code paths. For example, /page.cgi?action=viewProfile and /page.cgi?action=editUser involve (as the parameters imply) two different sets of functionality. Consequently, the scanner shouldn't blindly throw away redundant links just because the page and parameter names are identical.
Ideally, a scanner should cover the target's site's full functionality. In practice, measuring coverage solely from a black box perspective is hard. Solutions can take the brute force approach, scan every link, or an iterative approach that slowly narrows the scanner's focus. Some possible methodologies are
- Divide the web site into logical areas based on directory structure. Scan those areas separately.
- Sort the list of crawled URIs. Look for redundant links. Create scan rules (e.g. black lists) that avoid those links.
- Determine high-risk areas of the web site. Create scan rules (e.g. white lists) that instruct the scanner to only visit those links.