Please enable JavaScript.
Coggle requires JavaScript to display documents.
ELECTORAL DATA (Gather PDFs (Spiders/Downloaders (Deep Crawl with wget…
ELECTORAL DATA
Gather PDFs
Spiders/Downloaders
Deep Crawl with wget
809k PDFs Downloaded
709k Text PDFs
Extraction Begin
Convert PDFs to HTMLs
Extract Attributes from HTMLs
Output to structured CSVs
1 more item...
Form fill with Scrapy + download PDFs
`
Gather PDFs
Spiders/Downloaders
Download PDFs with Scrapy + form fills or Deep crawling with wget
100k Image PDFs
+
809k 1st page non english text PDFs to extract the good unicode addresses and other fields for analytical importance from the first page.
Convert PDFs to images
Use OpenCV to crop and extract relevant sections such as epic number, names from the images.
Send images to the Tesseract API for OCRing image to text
Tesseract API
Retrieve Text from the image sent.
1 more item...