Debugging sourcers
Debugging a Parser
Say we have an error like this:

First step is to download the page object and see the HTML.
Sourcing Data is fed into:
- PROD: sourcing-data-prod
- TEST: goodfit-sourcer-data-test
The best way is via the AWS CLI. Copy the payload.filename above into:
AWS_PROFILE=gf-prod aws s3 cp s3://sourcing-data-prod/crunchbase-enricher/www.crunchbase.com_organization_the-event-agency-e54e.json.gzip - | gunzip
(Changing the AWS_PROFILE accordingly)
The above is copying the S3 file to stdout (-) and then gunziping it.
If you want to just get the HTML content of the first item in the file, you can do:
AWS_PROFILE=gf-prod aws s3 cp s3://sourcing-data-prod/crunchbase-enricher/www.crunchbase.com_organization_the-event-agency-e54e.json.gzip - | gunzip | jq -r '.[0].htmlContent'
In this case we can clearly see that issue is being blocked by geo:

Parser Tips
Parsing Values
Each parsed property should return undefined where no value can be found in order to preserve historic values in the parsers table.
This is because we're UPSERT'ing the records when we update them.
null will overwrite any historic values, whereas, undefined properties are skipped in the update.
For this reason, we should always fall back to undefined for unknown values.
Resetting a sourcer
Once you have made your changes and fixed your parser, you will need to reset the parser to allow any records that previously errored to be re-parsed.
-- Replace `crunchbase` with the sourcer name
UPDATE public.crunchbase
SET sourcing_status = NULL
WHERE sourcing_status = 'Error';