Microsoft OneDrive (formerly SkyDrive
) is a file hosting service operated by Microsoft.
This notebook covers how to load documents from OneDrive
. By default the document loader loads pdf
, doc
, docx
and txt
files. You can load other file types by providing appropriate parsers (see more below).
Prerequisites
- Register an application with the Microsoft identity platform instructions.
- When registration finishes, the Azure portal displays the app registration’s Overview pane. You see the Application (client) ID. Also called the
client ID
, this value uniquely identifies your application in the Microsoft identity platform. - During the steps you will be following at item 1, you can set the redirect URI as
http://localhost:8000/callback
- During the steps you will be following at item 1, generate a new password (
client_secret
) under Application Secrets section. - Follow the instructions at this document to add the following
SCOPES
(offline_access
andFiles.Read.All
) to your application. - Visit the Graph Explorer Playground to obtain your
OneDrive ID
. The first step is to ensure you are logged in with the account associated your OneDrive account. Then you need to make a request tohttps://graph.microsoft.com/v1.0/me/drive
and the response will return a payload with a fieldid
that holds the ID of your OneDrive account. - You need to install the o365 package using the command
pip install o365
. - At the end of the steps you must have the following values:
CLIENT_ID
CLIENT_SECRET
DRIVE_ID
🧑 Instructions for ingesting your documents from OneDrive
🔑 Authentication
By default, theOneDriveLoader
expects that the values of CLIENT_ID
and CLIENT_SECRET
must be stored as environment variables named O365_CLIENT_ID
and O365_CLIENT_SECRET
respectively. You could pass those environment variables through a .env
file at the root of your application or using the following command in your script.
o365_token.txt
) at ~/.credentials/
folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change the auth_with_token
parameter to True in the instantiation of the loader.
🗂️ Documents loader
📑 Loading documents from a OneDrive Directory
OneDriveLoader
can load documents from a specific folder within your OneDrive. For instance, you want to load all documents that are stored at Documents/clients
folder within your OneDrive.
📑 Loading documents from a list of Documents IDs
Another possibility is to provide a list ofobject_id
for each document you want to load. For that, you will need to query the Microsoft Graph API to find all the documents ID that you are interested in. This link provides a list of endpoints that will be helpful to retrieve the documents ID.
For instance, to retrieve information about all objects that are stored at the root of the Documents folder, you need make a request to: https://graph.microsoft.com/v1.0/drives/{YOUR DRIVE ID}/root/children
. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.
📑 Choosing supported file types and preffered parsers
By defaultOneDriveLoader
loads file types defined in document_loaders/parsers/registry
using the default parsers (see below).
handlers
argument to OneDriveLoader
.
Pass a dictionary mapping either file extensions (like "doc"
, "pdf"
, etc.)
or MIME types (like "application/pdf"
, "text/plain"
, etc.) to parsers.
Note that you must use either file extensions or MIME types exclusively and
cannot mix them.
Do not include the leading dot for file extensions.