How can robots.txt disallow all URLs except URLs that are in sitemap

It's not a robots.txt related answer, it's related to the Robots protocol as a whole and I used this technique extremely often in the past, and it works like a charm.

As far as I understand your site is dynamic, so why not make use of the robots meta tag? As x0n said, a 30MB file will likely create issues both for you and the crawlers plus appending new lines to a 30MB files is an I/O headache. Your best bet, in my opinion anyway, is to inject into the pages you don't want indexed something like:
<META NAME="ROBOTS" CONTENT="NOINDEX" />
The page would still be crawled, but it won't be indexed. You can still submit the sitemaps through a sitemap reference in the robots.txt, you don't have to watch out to not include in the sitemaps pages which are robotted out with a meta tag, and it's supported by all the major search engines, as far as I remember by Baidu as well.

0 comments: