MEDIUM 6.5

GHSA-23j4-mw76-5v7h

Scrapy allows redirect following in protocols other than HTTP

Details

### Impact

Scrapy was following redirects regardless of the URL protocol, so redirects were working for `data://`, `file://`, `ftp://`, `s3://`, and any other scheme defined in the `DOWNLOAD_HANDLERS` setting.

However, HTTP redirects should only work between URLs that use the `http://` or `https://` schemes.

A malicious actor, given write access to the start requests (e.g. ability to define `start_urls`) of a spider and read access to the spider output, could exploit this vulnerability to: - Redirect to any local file using the `file://` scheme to read its contents. - Redirect to an `ftp://` URL of a malicious FTP server to obtain the FTP username and password configured in the spider or project. - Redirect to any `s3://` URL to read its content using the S3 credentials configured in the spider or project.

For `file://` and `s3://`, how the spider implements its parsing of input data into an output item determines what data would be vulnerable. A spider that always outputs the entire contents of a response would be completely vulnerable, while a spider that extracted only fragments from the response could significantly limit vulnerable data.

### Patches

Upgrade to Scrapy 2.11.2.

### Workarounds

Replace the built-in retry middlewares (`RedirectMiddleware` and `MetaRefreshMiddleware`) with custom ones that implement the fix from Scrapy 2.11.2, and verify that they work as intended.

### References

This security issue was reported by @mvsantos at https://github.com/scrapy/scrapy/issues/457.

Are you affected?

Enter the version of the package you're using.

Affected packages

PyPI / scrapy

Introduced in: 0 Fixed in: 2.11.2

Fix pip install --upgrade 'scrapy>=2.11.2'