Forums

Working with files saved on a local machine

Hi All

I am building an app that would allow users to do some NLP analysis on a text coming from multiple files. My understanding is that python script called within flask would not have the visibility of the files saved on my local machine only the files saved on the server on my account (static folder , virutal environment etc) To by-pass this problem I have used an HTML file-input and Javascript FileReader objects which allowed me to load multiple files, extract their text content and output it in a textarea field - from where I am able to pass the text over to python for the further analysis.
I managed to get this working for TXT and DOCX files but it cost me a lot of time trying to figure out solution in a language I am completely unfamiliar with (Javascript). I still need to add some functionality to support extraction of text from PDF files but before drifting any further from the main focus of this project (NLP) I wanted to ask if there is any elegant way of passing the content of local files to the scripts in the virtual environment? E.g.
The HTML object File-Input allows me to select multiple PDF files - is there anyway I can pass these files over so instead of Javascript I could use python libraries to extract the text from these PDF files?

<input type="file" id="filesx" name="filesx[]" onchange="readmultifiles(this.files)" multiple="" accept=".txt,.docx,.pdf"/>

Sure! The HTML input with type "file" is actually designed to allow you to upload files directly to the server without needing JavaScipt, and you can have multiple in the same form. The files themselves become visible as file-like objects in your Flask code. The last example in this tutorial would be a good starting point.

Hi Giles- many thanks for your answer.

Following the suggested tutorial i've managed to successfully read from : - TXT files, using : file.stream.read().decode("utf-8") - PDF files , passing : file.stream object to 3rd party lib "pdfplumber"

2 down , one to go - DOCX files. I was hoping that similarly to PDFs I would be able to read doc files using a popular package called DOCX but for the life of me I am unable to get it to work using the above objects. I am not getting any errors - just a blank result. The documentation for DOCX mentions that it should be able to deal with the 'FILE-LIKE' objects : https://python-docx.readthedocs.io/en/latest/user/documents.html but their example shows a function using a path rather then the objects retrieved from HTML Input files. I appreciate that this is not really related to pythonanywhere any more but perhaps you or someone else has already come up with a working solution for reading DOCX file like objects and could suggest a fix ?

part of my code :

import docx def getDOCXText(filename):

doc = docx.Document(filename)
fullText=''
for para in doc.paragraphs:
    fullText=fullText+para.text
return fullText

for file in request.files.getlist("input_file"): txt=getDOCXText(file.stream)

file.stream is a file-like object. So you can replace the file-like object from the docs with that. If you're not getting anything output, perhaps it's because the library cannot convert the file that you're sending it. Make sure that you can convert the file properly from a file.

I've just tried another DOCX and this time the script was able to convert it - so it was an issue with the source file. Many thanks for helping me out!

Good to see that you figured it out!