drupal module robotstxt
http://drupal.org/project/robotstxt 6.x-1.2 on 6.14
Main function:
Supports site specific creation of robots.txt files in multisite setup (one code base, various databases).
"You must rename or remove the original root robots.txt file" (freely quoting module introduction). Note that when you do this, the sites for which this module is not enabled will not have a robots.txt file!
Example function:
When you have a multisite setup and need to give search engines specific robotstxt intructions on a per site basis.
This may become necessary if on a multilanguage site, users and search engines have become exposed to the wrong "shadow" language paths. Further access may be stopped in order to speed up se memory loss for these paths.
Module location: Other Uninstall: yes
Manage:
Well. I have not found instructions. There is no drupal handbook documentation. There is a short explanatory introduction on the project/robotstext main page.
The admin user interface at Site configuration > TobotsTxt shows its own full generic robots.txt in an edit window, with Save configuration and Reset to defaults buttons. There is no more to it. You can make changes, Save configuration. When you go to another one of the full domain multisites, enable the module there, the robot.txt file should start fresh = is different. That works.
By the way: You might not want to forget to enable the module for all multisite domains. Especially since the original robots.txt, that was valid for all, is either renamed or deleted. (I almost did - actually I disabled and undid the entire thing since it wasn't working for me (see next)).
Issue:
It does not work on a multisite with symlinked subdomains, which are typical of a multilanguage setup by subdomains. I'm using multisite language subdomains: domain.com and fr.domain.com. The subdomain is symlinked. Apparently it does not work to vary robots.txt. among such (sub) domains, there is no difference when going to these different sites. It doesn't work.
I really do not like experimenting with the robots.txt file, this is a bit dangerous. Yet I' wondering if this can be solved by
a) making the robots.txt entry window a multilinugal variable (!). The specific content for the module's robots.txt file is saved in the "variable" table under the name: robotstxt. So, I take a guess that "robotstxt" should be the name of the multilingual variable... +) Since all the following ideas appear not to work, this the only one left. Needs to be tested.
b) dropping the module into each (sub)domain/module/ directory separately. This should also separate the subdomains. But all (sub &)domains need to have it, since the root robots.txt file must be removed/renamed.
-> No this does not separate the symlinked language subdomain from the main language domain (based on lengthy experimentation).
c) There is also an ongoing issue on having one default and then specific robots.txt files, which sounds like a reasonable thing to do: http://drupal.org/node/619404 In that thread it's also mentioned that the root might contain a non-standard robots.txt file that is referenced in an individual site's settings.php file, see http://drupal.org/node/619404#comment-2210728. If this could actually work...
-> No, it's the same as b, above and it also needs the patched module to make the reference. This won't solve it.
d) In any other manual way? No idea how to do it.
Why do I want to do this, at all?
In order to disallow robots from probing the fairly large unused portion of a multilingual site. Especially, if through some goof-up, se's have already been led into that part and are finding lots of duplicate/triplicate+, content there. This seems to me a good way to try to put a stop to it.
Robots.txt allow/disallow schema:
two domains:
lang1.domain.com
lang2.domain.com
1 node and its translation:
N1 = node 1 in default lang1 language (original),
RN1 = copy of node 1 in the lang2 subdomain (shadow),
N2-T1 = node 2 - translation of node 1 in lang2 subdomain (original)
RN2-T1 copy of node 2 - translation of node 1 in default lang2 language (shadow)
Lang1 ----- Lang2 (location on domains)
N1 -------- RN1
RN2-T1 ---- N2-T1
Nodes with R need to be disallowed in robots.txt for each subdomain.
If there are more languages, the whole thing grows stronger than straigt line, I hope it isn't exponential. But clearly, a shadow copy is created in each language, for both the original each of its translations (yes, I believe that is exponential).
Note that the same shadow show goes for languages by directory path (domain.com/lang2), and any multilanguage site. In that case the robot.txt paths for allow/disallow are quite straight forward (while the number of disallows will be the same).
Some further reaching ideas:
- a module to create all the disallow paths, and possibly insert them into a robots.txt file
- a module for path aliases that generates these only for the original and translated versions and leaves node/ for all shadow versions, which can all be disallowed in an (almost) single robots.txt entry.
Back to the original issue:
I need to figure out how to make this work.
Good luck, jwr. Help /suggestions are very much appreciated.