Please enable JavaScript.

Coggle requires JavaScript to display documents.

ELECTORAL DATA (Gather PDFs (Spiders/Downloaders (Deep Crawl with wget…

- - - - 809k PDFs Downloaded
        
        709k Text PDFs
        
        Extraction Begin
        
        Convert PDFs to HTMLs
        
        Extract Attributes from HTMLs
        
        Output to structured CSVs
        
        1 more item...
    - - `
- - - - 100k Image PDFs
        +
        809k 1st page non english text PDFs to extract the good unicode addresses and other fields for analytical importance from the first page.
        
        Convert PDFs to images
        
        Use OpenCV to crop and extract relevant sections such as epic number, names from the images.
        
        Send images to the Tesseract API for OCRing image to text
        
        Tesseract API
        
        Retrieve Text from the image sent.
        
        1 more item...