PDF 文件
- 版本 :2022.1 及更高版本
本文介绍如何将 Tableau 连接到.pdf文件数据并设置数据源。 注意:Tableau 不支持从右向左 (RTL) 语言。如果您的 PDF 包含 RTL 文本,则字符在 Tableau 中可能会以相反的顺序显示。
建立连接并扫描文档中的表格
打开 Tableau 后,在“连接”下,单击“PDF 文件”。
选择要连接到的文件,然后单击“打开”。
在“扫描 PDF 文件”对话框中,指定希望 Tableau 扫描文件中的表的页面。您可以选择扫描所有页面中的表,仅扫描单个页面或一系列页面。
注意:扫描将文件的第一页计为第 1 页,类似于大多数 PDF 阅读器。扫描表格时,请指定 PDF 阅读器显示的页码,而不是文档本身可能使用的页码,后者可能从第 1 页开始,也可能不从第 1 页开始。
例如,假设您要使用下图中的“表 1”。PDF 阅读器显示一个数字,.pdf文件显示一个不同的数字。要正确扫描此表,请指定 PDF 阅读器显示的页码。在此示例中,您将指定第 15 页。
在数据源页面上,执行以下操作:
Tableau 已在页面上标识了另一个唯一表或子表。
Tableau 以另一种方式解释页面上的表。Tableau 可能会提供表的多种解释,具体取决于表在.pdf文件中的显示方式。
(可选)选择页面顶部的默认数据源名称,然后输入要在 Tableau 中使用的唯一数据源名称。例如,使用数据源命名约定,帮助数据源的其他用户确定要连接到哪个数据源。默认名称是根据文件名自动生成的。
如果文件包含一个表,请单击工作表选项卡以开始分析。否则,从左窗格中将表拖动到画布上,然后单击工作表选项卡以开始分析。
关于左窗格中的表
在.pdf文件中标识的表被赋予唯一的名称,并在扫描后显示在左窗格中。例如,您可能会看到类似“第 1 页,表 1”之类的表名称。表名的第一部分指示表来自.pdf文件中的页面。表名的第二部分指示表的标识顺序。如果 Tableau 在页面上标识了多个表,则表名称的第二部分可以指示以下两种情况之一:
PDF 文件数据源示例
Here is an example of a PDF file data source:
Get more data
Get more data into your data source by adding more tables or connecting to data in a different database.
Add more data from the current file:
From the left pane, drag additional tables to the canvas to combine data using a join or union. For more information, see Join Your Data or Union Your Data.
If the pages that were scanned in step 3 of the procedure listed above do not produce the tables that you need in the left pane, click the drop-down arrow next to the PDF File connection, and click Rescan PDF file. This option allows you to create a new scan so that you can specify different pages in the .pdf file to scan for tables.
Add more data from a different database: In the left pane, click Add next to Connections. For more information, see Join Your Data.
If a connector you want is not listed in the left pane, select Data > New Data Source to add a new data source. For more information, see Blend Your Data.
Set table options
You can set table options. On the canvas, click the table drop-down arrow and then specify whether the data includes field names in the first row. If so, these names will become the field names in Tableau. If field names are not included, Tableau generates them automatically. You can rename the fields later.
Use Data Interpreter to clean your data
If Tableau detects that it can help optimize your data source for analysis, it prompts you to use Data Interpreter. Data Interpreter can detect sub-tables that you can use and remove unique formatting that might cause problems later on in your analysis. For more information, see Clean Data from Excel, CSV, PDF, and Google Sheets with Data Interpreter.
Union tables in your .pdf files
You can union tables in your file. For more information about union, see Union Your Data.
When you use wildcard search to union tables, the result is scoped to the pages that were scanned in the initial file you connected to. For example, suppose you have three files: A.pdf, B.pdf, and C.pdf. The first file you connect to is A and you limit the scan for tables to page 1. When you use wildcard search to union tables from files B and C, the additional tables included in the union can only come from page 1 of B and page 1 of C.
Tips for working with .pdf files
The following tips can help you work with your .pdf files in Tableau.
Use PDF File connector to identify just the tables in your .pdf file.
The primary goal of the PDF File connector is to find and identify tables in your .pdf file. Therefore, it ignores any other information in the file that does not appear to be part of a table, including titles, captions, and footnotes. If related data is stored in one of these areas, such as in the table title, you can use Tableau to first export the .pdf file data into a .csv file, manually add the data that was stored in the table title, and then connect to the .csv file instead. For more information, see Export your data to .csv file .
Use standard tables.
In general, Tableau works best with standard tables that use a tabular format.
Ideally, the tables in your .pdf file have column headers on a single line and have rows values on a single line as demonstrated in the example below.
Colors and shading used in or around the tables can affect how the tables are identified.
Tables that have unique formatting might require some cleanup or manual editing outside of Tableau. Unique formatting can include hierarchical headers, header names that span multiple lines, row values that span multiple lines, angle headers, and stacked tables as demonstrated in the examples shown below.
Note: Tableau does not support connections to .pdf files generated by scanning (optical character recognition) software.
Validate the data.
Make sure that you validate the data in the tables that Tableau identifies in your .pdf file. You can validate the data by using either the data grid or if you used the Data Interpreter, the results workbook.
Avoid tables that span across pages.
If your .pdf file contains a table that spans across pages, Tableau interprets that table as multiple tables. To resolve this issue, use a union to combine the tables. For more information, see Union Your Data.
Rename .pdf files whose file names contain unicode characters.
After connecting to a .pdf file that contains unicode characters in its file name, you might see the following error.
To resolve this issue, rename the file using non-unicode characters, and connect to your .pdf file again.
Do not use password protected .pdf files.
After connecting to and scanning a .pdf file for tables, you might see the following error.
Tableau shows this error when your .pdf file is password protected and unable to access its contents. Tableau is unable to support connections to password protected .pdf files.
Alias values that are interpreted differently or incorrectly.
In the data grid you might notice that some values are interpreted differently from the .pdf file. You can correct this interpretation by using aliases to rename specific values within a field.
For example, suppose you see the following table after connecting to your .pdf file. Some state abbreviations are being interpreted in lowercase form, which are highlighted in blue.
You can resolve this issue by using aliases to change the lowercase abbreviations to uppercase abbreviations. To do this, click the drop-down arrow next to the column name and select Aliases.
Resolve column headers that are interpreted as table values.
In the data grid you might also notice that some column headers in your .pdf file are interpreted as table values instead. This can occur if your .pdf file contains tables with unique formatting or hierarchical headers. In this scenario, try the Data Interpreter first. If Data Interpreter doesn't resolve the issue, consider manually renaming the columns to their appropriate names and filtering header names that are being treated as values by using data source filters.
For example, suppose you see the following table after connecting to your .pdf file. The table headers from the .pdf file are being interpreted as table values, which are highlighted in blue.
One way you can resolve a header issue like this is to follow steps similar to the following:
Double-click the column name, and then rename F1 to Year. Repeat this step for F2 through F4 for Coal, Gas, and Oil.
Click the data type icon for the Year column and change it to a number data type. This causes the non-numerical values in this column to convert to null values.
In the upper-right corner of the data source page, click Add, click the Add button, and then select the Year field.
In the Filter dialog box, select both the Null and Exclude check boxes.
The rows in the Year column that contain null values are removed from the data grid, which affect the rows from the other columns in the table.
关于 .ttde 和 .hhyper 文件
在浏览计算机目录时,您可能会注意到 .ttde 或 .hhyper 文件。创建连接到数据的 Tableau 数据源时,Tableau 会创建一个 .ttde 或 .hhyper 文件。此文件也称为卷影数据提取,用于帮助提高数据源在 Tableau Desktop 中的加载速度。尽管影子数据提取包含与标准 Tableau 数据提取类似的基础数据和其他信息,但影子数据提取以不同的格式保存,不能用于恢复数据。
在某些情况下,您可能需要从计算机中删除影子数据提取。有关详细信息,请参阅 Tableau 知识库中的由于影子提取而导致的磁盘空间不足。