cpan-testers/cpantesters-web

Create new view-report.cgi

preaction opened this issue · 7 comments

The existing view-report.cgi is very slow and is often the single cause of server overload. To help reduce this overload, we can migrate this first thing to a new Mojolicious web app. The larger purpose of this web app will be to maintain legacy APIs for backwards compatibility.

To replace the existing view-report.cgi, we need a web app with the following routes:

The application should be started by a script in bin/cpantesters-web-legacy. The application should be deployed by the Rexfile, and started by a runit service file written in etc/runit/web-legacy (remember to add the corresponding log service too).

The current report also expects requests using nttp instead of guid as parameter:

sub retrieve_report {
    $tvars{body}{result} = '""';

    if ( $cgiparams{id} =~ /^\d+$/ ) {

#$dbi is from Labyrinth::Variables, configured by Labyrinth::DBUtils DBConnect
#config/phrasebook.ini:GetStatReport=SELECT * FROM cpanstats.cpanstats WHERE id=?
        my @rows = $dbi->GetQuery( 'hash', 'GetStatReport', $cgiparams{id} );

        if (@rows) {
            if ( $rows[0]->{guid} =~ /^[0-9]+\-[-\w]+$/ ) {
                my $id = guid_to_nntp( $rows[0]->{guid} );
                _parse_nntp_report($id);
            }
            else {
                $cgiparams{id} = $rows[0]->{guid};
                _parse_guid_report();
            }
        }
        else {
            #$tvars{errcode} = 'NEXT';
            #$tvars{command} = 'cpan-distunk';
        }
    }
    elsif ( $cgiparams{id} =~ /^[\w-]+$/ ) {
        my $id = guid_to_nntp( $cgiparams{id} );
        if ($id) {
            _parse_nntp_report($id);
        }
        else {
            _parse_guid_report();
        }
    }

I would encourage dropping the nntp number support, as these were only for reports submitting upto September 2010, for < 9 million reports. It's rare that anyone (excluding bots) would even use that number for retrieval any more. All those reports now have GUIDs anyway, so they aren't lost.

Thanks, @barbie, I was wondering about that: If all the reports got migrated to Metabase, then, yeah, that whole section of code can be ignored.

If we want to be super extra safe, we can view the logs to see if anyone's accessed those IDs in the last week or so. I can't think of any way they've been linked, so likely even bots won't show up in the logs. I have no strong feeling on that. And if someone comes forward saying it was important to them, we can always add it back (they won't, so I'm not worried).

I have some more doubts here.

This issue refers to the rewrite of view-report.cgi, but:

The local metabase cache holds the raw text of all the reports that we have. This will soon be the only metabase (see cpan-testers/cpantesters-api#3), so we should have an API for this in the schema. Using this new API as an abstraction will allow us to change the details of the local metabase cache later.

I'm not seeing that view-report.cgi uses a "local" cache (unless the local cache is the Mysql server at IP ending in 45), this is from a debug session of view-report.cgi:

Labyrinth::Globals::DBConnect(lib/Labyrinth/Globals.pm:412):
412:	    my $logfile     = $settings{logfile};
  DB<4> 
Labyrinth::Globals::DBConnect(lib/Labyrinth/Globals.pm:413):
413:	    my $phrasebook  = $settings{phrasebook};
  DB<4> 
Labyrinth::Globals::DBConnect(lib/Labyrinth/Globals.pm:414):
414:	    my $dictionary  = $settings{dictionary};
  DB<4> 
Labyrinth::Globals::DBConnect(lib/Labyrinth/Globals.pm:416):
416:	    $dbi = Labyrinth::DBUtils->new({
417:	        driver          => $settings{driver},
418:	        database        => $settings{database},
419:	        dbfile          => $settings{dbfile},
420:	        dbhost          => $settings{dbhost},
421:	        dbport          => $settings{dbport},
422:	        dbuser          => $settings{dbuser},
423:	        dbpass          => $settings{dbpass},
424:	        autocommit      => $settings{autocommit},
425:	        logfile         => $logfile,
426:	        phrasebook      => $phrasebook,
427:	        dictionary      => $dictionary,
  DB<4> p $settings{driver}
mysql
  DB<5> p $settings{database}
reports
  DB<6> p $settings{dbhost}
###########.45
  DB<7> q

These are the queries executed by the CGI that I checked so far (in this specific order) for getting reports from GUID:

  1. config/phrasebook.ini:AllDistIndices=SELECT uploadid,dist,version,type FROM cpanstats.uploads
  2. config/phrasebook.ini:AllOSNames=SELECT osname,ostitle FROM cpanstats.osname
  3. config/phrasebook.ini:GetMetabaseByGUID=SELECT * FROM metabase.metabase WHERE guid=?
  4. config/phrasebook.ini:GetTesterFact= SELECT mte.*,tp.* FROM metabase.testers_email mte LEFT JOIN testers.address ta ON ta.email=mte.email LEFT JOIN testers.profile tp ON tp.testerid=ta.testerid WHERE mte.resource=? ORDER BY tp.testerid DESC

This issue refers to a REST API that retrieves data based on DBIx::Class based https://github.com/cpan-testers/cpantesters-schema, but I don't see any the databases/tables that the view-report.cgi uses nowadays. I listed the EER diagrams of the tables that the CGI uses:

  • metabase DB
    metabase_eer
  • testers DB
    testers_eer

So, if I understood correctly, it would be necessary to implement the corresponding DBIx::Classes of metabase and testers and then implement the corresponding REST methods? Or these are already implemented (DBIx::Class and/or REST)?

#9 is ready for revision.

I missed this thread, but from the commits in #9 you've done the right thing and connected to the database directly instead of through DBIx::Class. I tried writing DBIx::Class results for the tables in the metabase database, like you saw, but it was annoying at best (it made things really hard to test). The new plan is to only write DBIx::Class result classes for tables found in the cpanstats database (and, if necessary, move tables into that database, like the new metabase_user table).

The metabase cache is in the MySQL database you found, yes. I had called it the "local" metabase as that database used to live on the CPAN Testers server with the website and everything else, and it's "local" as distinct from the Amazon SimpleDB (which is a different database system entirely).

The REST API I referred to was for the cpantesters-api project, but that all is done now in a different way (using the new test report format). The cpantesters-web project is meant to access the database directly either through DBI (DBIx::Connector) or CPAN::Testers::Schema (DBIx::Class), not through the API server. Right now the idea is that it will make the website faster to not have the layer of abstraction there, and it's easier to write the website in Perl using DBIx::Class than it would be to write it in Perl or JavaScript using the JSON API (also developing that JSON API would mean everything would have to be done twice).