How to Process a Large CSV File in .NET Without Running Out of Memory

Q: Can .NET process a 20 GB CSV file with limited memory?

Yes. A .NET application can process a 20 GB CSV file with limited memory by reading it as a stream and retaining only the current record, a controlled batch and bounded pipeline buffers.

Q: Does async file reading reduce memory usage?

Not automatically. Asynchronous I/O reduces thread blocking, but methods such as ReadAllLinesAsync still materialize the complete result. Streaming and asynchronous I/O address different concerns.

Q: What buffer size should I use for a large file?

A moderate buffer such as 64 KB or 128 KB is a reasonable starting point. The final value should be selected through benchmarking with representative files, storage and concurrency.

Q: What is the best database batch size?

There is no universal best batch size. Values between 1,000 and 10,000 records are useful starting points, but the final size depends on record width, database behavior, memory, latency and retry requirements.

Q: Can I parse a CSV file with string.Split?

Only for a strictly controlled format in which fields cannot contain delimiters, quotation marks or line breaks. General CSV files require a parser that understands CSV quoting and escaping rules.

Q: Does CsvHelper load the complete file into memory?

CsvHelper can enumerate records progressively when the sequence is consumed directly. Calling ToList, ToArray or another materializing operation stores all resulting records in memory.

Q: Should I read one large CSV file using multiple threads?

Usually not. Partition boundaries can occur inside quoted or multiline fields. A safer design uses one sequential parser and bounded parallelism for independent downstream transformations.

Q: How should a large CSV upload be handled in ASP.NET Core?

Store the file in durable storage, create an import job and process it with a background worker. The client should query job status or receive a notification instead of keeping one HTTP request open.

Q: How can large JSON and XML files be processed with low memory?

Use an asynchronous enumerable or forward-only UTF-8 reader for JSON and XmlReader for XML. Avoid loading a complete document model when progressive processing is sufficient.

Q: What happens if one record is extremely large?

One exceptionally large record can still create a memory spike because streaming limits memory relative to the data currently being processed. Apply maximum record and field sizes when the format permits.

To process a large CSV file in .NET safely, the application should stream records progressively instead of loading the complete file into memory.

A 10 GB, 15 GB, or 20 GB file can create serious memory pressure when an application uses methods such as ReadAllLinesAsync, ReadToEnd, or converts every parsed record into a collection.

The central principle is simple: file size should not determine application memory usage. Memory should instead be controlled by the buffer size, batch size, record size, channel capacity, and number of workers.

Quick recommendation

Use a sequential FileStream, parse records with CsvHelper, validate one record at a time, and write controlled batches with SqlBulkCopy.

How to Process a Large CSV File in .NET Without Loading It into Memory

The recommended architecture reads only a small portion of the file at a time. Each record is parsed, validated, transformed, and persisted before the application continues through the source file.

Why Large CSV Files Cause Memory Problems

The most common problem is not the size of the file itself. The problem is how the application reads and represents it.

Consider the following implementation:

string[] lines = await File.ReadAllLinesAsync( filePath, cancellationToken);

Although the method is asynchronous, it still returns an array containing every line in the file. Asynchronous I/O prevents a thread from being blocked while waiting, but it does not automatically make an operation memory-efficient.

The situation can become worse after the data is:

decoded from bytes into .NET strings;
split into fields;
converted into objects;
stored inside arrays, lists, or a DataTable;
duplicated during transformation;
retained until a database operation completes.

A file that occupies 20 GB on disk can therefore require substantially more memory if the entire dataset is materialized.

Large-File Processing Approaches Compared

Approach	Memory behavior	CSV correctness	Best use
`ReadAllLinesAsync`	Loads all lines into memory	Does not parse CSV fields	Small, controlled files
`StreamReader.ReadLineAsync`	Keeps approximately one line in application code	Insufficient for multiline quoted fields	Logs, NDJSON, and simple delimited text
Streaming CsvHelper records	Processes records incrementally	Handles CSV quoting and escaping rules	Real production CSV files
Streaming plus bounded batches	Memory is limited by batch size	Depends on the parser	Database imports and ETL
Bounded Channel pipeline	Memory is limited by channel capacity and worker count	Depends on the parser	CPU-heavy validation and transformation

The Recommended .NET Architecture

A production-ready file ingestion pipeline can be organized into the following stages:

Large CSV file or cloud-storage stream ↓ Sequential FileStream or network stream ↓ CSV parser ↓ Record validation and transformation ↓ Optional bounded Channel ↓ Controlled batch ↓ Staging table or destination system ↓ Validation, merge, checkpoint and completion status

Each stage should have a clear memory boundary. The reader should not be allowed to produce unlimited records when the database, API, or downstream service cannot consume them at the same speed.

Anti-Patterns to Avoid

The following operations are convenient for small files but risky for very large imports:

// Loads the complete file into one string. string content = await File.ReadAllTextAsync(filePath); // Loads every line into an array. string[] lines = await File.ReadAllLinesAsync(filePath); // Materializes all parsed records. List<CustomerImportRow> records = csv.GetRecords<CustomerImportRow>().ToList(); // Builds a DataTable containing the complete file. DataTable completeImport = BuildDataTableForEntireFile(filePath); // Creates unrestricted parallel work. Task[] tasks = records .Select(record => Task.Run(() => Process(record))) .ToArray();

Other common mistakes include using an unbounded queue, inserting one database row at a time, keeping every rejected record in memory, or assuming that increasing parallelism will always improve disk or database throughput.

Step 1: Read the File Sequentially

For line-oriented text, FileStream and StreamReader provide a straightforward starting point.

public static async Task ProcessTextFileAsync( string filePath, CancellationToken cancellationToken = default) { var options = new FileStreamOptions { Mode = FileMode.Open, Access = FileAccess.Read, Share = FileShare.Read, Options = FileOptions.Asynchronous | FileOptions.SequentialScan, BufferSize = 64 * 1024 }; await using var stream = new FileStream(filePath, options); using var reader = new StreamReader(stream); long lineNumber = 0; while (true) { string? line = await reader.ReadLineAsync(cancellationToken); if (line is null) { break; } lineNumber++; await ProcessLineAsync( line, lineNumber, cancellationToken); } } private static Task ProcessLineAsync( string line, long lineNumber, CancellationToken cancellationToken) { // Validate, transform or write the current line. return Task.CompletedTask; }

FileOptions.SequentialScan communicates that the file will primarily be read from beginning to end. It is appropriate for large imports that do not require random access.

This implementation keeps memory usage bounded, but it should not be treated as a complete CSV parser.

Why string.Split Is Not a Reliable CSV Parser

A CSV file is not always equivalent to one record per physical line with fields separated by commas.

For example:

1024,"Valluri, Nirmala","Cary, NC","Customer said ""please call""."

A CSV file can contain:

commas inside quoted fields;
escaped quotation marks;
line breaks inside quoted fields;
different delimiters;
empty or missing columns;
different date and numeric formats;
headers that do not exactly match property names.

The following code can therefore produce incorrect results:

string[] columns = line.Split(',');

For real CSV files, use a parser that understands CSV quoting and escaping rules.

Step 2: Stream Records with CsvHelper

Install the required packages:

dotnet add package CsvHelper dotnet add package Microsoft.Data.SqlClient

Create a model representing one record:

public sealed class CustomerImportRow { public long CustomerId { get; set; } public string Name { get; set; } = string.Empty; public string Email { get; set; } = string.Empty; public decimal Balance { get; set; } }

The following importer reads records progressively, validates them, and retains only a bounded batch.

using System.Data; using System.Globalization; using CsvHelper; using CsvHelper.Configuration; using Microsoft.Data.SqlClient; public sealed record ImportResult( long RowsRead, long RowsImported, long RowsRejected); public sealed class LargeCsvImporter { private const int BatchSize = 5_000; private readonly string _connectionString; public LargeCsvImporter(string connectionString) { _connectionString = connectionString; } public async Task<ImportResult> ImportAsync( string filePath, CancellationToken cancellationToken = default) { long rowsRead = 0; long rowsImported = 0; long rowsRejected = 0; var batch = new List<CustomerImportRow>(BatchSize); var fileOptions = new FileStreamOptions { Mode = FileMode.Open, Access = FileAccess.Read, Share = FileShare.Read, Options = FileOptions.Asynchronous | FileOptions.SequentialScan, BufferSize = 128 * 1024 }; var csvConfiguration = new CsvConfiguration(CultureInfo.InvariantCulture) { HasHeaderRecord = true, TrimOptions = TrimOptions.Trim, DetectColumnCountChanges = true }; await using var fileStream = new FileStream(filePath, fileOptions); using var textReader = new StreamReader(fileStream); using var csv = new CsvReader(textReader, csvConfiguration); await using var connection = new SqlConnection(_connectionString); await connection.OpenAsync(cancellationToken); await foreach ( CustomerImportRow row in csv.GetRecordsAsync<CustomerImportRow>() .WithCancellation(cancellationToken)) { rowsRead++; if (!TryValidate(row, out string validationError)) { rowsRejected++; // In production, write the row number, record identifier // and validationError to a rejected-record destination. continue; } batch.Add(row); if (batch.Count < BatchSize) { continue; } await WriteBatchAsync( connection, batch, cancellationToken); rowsImported += batch.Count; batch.Clear(); } if (batch.Count > 0) { await WriteBatchAsync( connection, batch, cancellationToken); rowsImported += batch.Count; batch.Clear(); } return new ImportResult( RowsRead: rowsRead, RowsImported: rowsImported, RowsRejected: rowsRejected); } private static bool TryValidate( CustomerImportRow row, out string validationError) { if (row.CustomerId <= 0) { validationError = "CustomerId must be greater than zero."; return false; } if (string.IsNullOrWhiteSpace(row.Name)) { validationError = "Name is required."; return false; } if (string.IsNullOrWhiteSpace(row.Email) || !row.Email.Contains('@')) { validationError = "Email is invalid."; return false; } validationError = string.Empty; return true; } private static async Task WriteBatchAsync( SqlConnection connection, IReadOnlyCollection<CustomerImportRow> batch, CancellationToken cancellationToken) { using DataTable table = BuildDataTable(batch); using var bulkCopy = new SqlBulkCopy(connection) { DestinationTableName = "dbo.CustomerImportStaging", BatchSize = batch.Count, BulkCopyTimeout = 300 }; bulkCopy.ColumnMappings.Add( "CustomerId", "CustomerId"); bulkCopy.ColumnMappings.Add( "Name", "Name"); bulkCopy.ColumnMappings.Add( "Email", "Email"); bulkCopy.ColumnMappings.Add( "Balance", "Balance"); await bulkCopy.WriteToServerAsync( table, cancellationToken); } private static DataTable BuildDataTable( IEnumerable<CustomerImportRow> rows) { var table = new DataTable(); table.Columns.Add( "CustomerId", typeof(long)); table.Columns.Add( "Name", typeof(string)); table.Columns.Add( "Email", typeof(string)); table.Columns.Add( "Balance", typeof(decimal)); foreach (CustomerImportRow row in rows) { table.Rows.Add( row.CustomerId, row.Name, row.Email, row.Balance); } return table; } }

This example deliberately creates a DataTable for only one batch—not for the entire file. After the bulk write completes, the batch and its temporary table can be released.

For extremely high-throughput systems, a custom IDataReader can stream records directly into SqlBulkCopy and avoid the intermediate DataTable. However, bounded batch tables are often simpler to implement and sufficiently efficient when their size is measured and controlled.

How to Choose a Batch Size

There is no universally correct batch size. A useful starting point might be 1,000 to 10,000 records, but the correct value depends on:

the average size of each record;
the number and type of columns;
validation and transformation complexity;
available application memory;
database transaction-log behavior;
network latency;
indexes and constraints on the destination table;
how quickly a failed batch must be retried.

Benchmark several batch sizes using representative production data. A larger batch can improve throughput, but it also increases memory usage, transaction size, retry cost, and the amount of work lost when a batch fails.

Step 3: Add Backpressure with a Bounded Channel

A streaming reader can still create memory pressure if it reads faster than downstream processing can complete.

Consider this situation:

Fast file reader ↓ Unlimited in-memory queue ↓ Slow API or database writer ↓ Memory continues growing

A bounded Channel<T> provides backpressure. When the channel reaches its configured capacity, the producer waits for consumers to create space.

using System.Threading.Channels; Channel<CustomerImportRow> channel = Channel.CreateBounded<CustomerImportRow>( new BoundedChannelOptions(capacity: 2_000) { FullMode = BoundedChannelFullMode.Wait, SingleWriter = true, SingleReader = false });

A more advanced pipeline can use:

CSV reader ↓ Bounded input channel ↓ Validation and transformation workers ↓ Bounded output channel ↓ Single batch writer ↓ Database

The database writer is usually kept controlled because unrestricted concurrent bulk writes can create contention, lock pressure, connection pressure, and unpredictable transaction-log growth.

Should the CSV File Be Read in Parallel?

Reading different portions of a CSV file in parallel is more complicated than splitting a byte range into equal pieces.

A partition boundary could occur:

in the middle of a UTF-8 character;
in the middle of a quoted field;
inside a record containing embedded line breaks;
between an opening and closing quotation mark.

For most systems, a safer design is:

one sequential parser per file;
bounded parallelism for independent record transformations;
controlled batching for the destination;
multiple files processed concurrently with a global concurrency limit.

Partitioned reading is reasonable only when the format guarantees safe boundaries or when a preprocessing step creates those boundaries.

Error Handling and Rejected Records

A production import should distinguish between several kinds of failure:

Failure type	Example	Recommended action
Business validation	Missing customer identifier	Reject the record and continue
Type conversion	Invalid decimal or date	Capture row details and continue when safe
CSV structure	Broken quotation or inconsistent columns	Quarantine the file or apply an approved recovery rule
Transient destination failure	Temporary database or network failure	Retry the current batch with limits
Permanent system failure	Schema mismatch or missing destination	Stop the import and mark it failed

Rejected records should be written to a separate file, table, or object-storage location as they are encountered. Avoid storing millions of error objects in a list until the import finishes.

A useful rejected-record entry can contain:

import identifier;
record or row number;
source record identifier;
validation or conversion message;
sanitized source data;
timestamp.

Make the Import Restartable

If a process fails after reading 18 GB of a 20 GB file, restarting from the beginning may be expensive and operationally frustrating.

Store import metadata such as:

ImportId SourceFileName SourceFileHash SourceFileSize StartedAt LastUpdatedAt RowsRead RowsImported RowsRejected LastCommittedBatch CurrentStatus FailureReason

Checkpoint only after a batch has been committed successfully. This prevents the checkpoint from advancing beyond durable data.

Byte-offset recovery is possible for some formats, but it must be designed carefully. A byte offset can point into the middle of a quoted CSV record. Batch numbers, source record keys, staging tables, or precomputed safe boundaries may be easier to reason about.

Use Idempotency to Prevent Duplicate Imports

A restartable process must also be safe to execute more than once.

Common approaches include:

assigning every import a unique identifier;
loading data into a staging table first;
creating unique constraints on business keys;
using a file hash to identify repeated uploads;
merging staged records into the final tables;
tracking which batches were committed;
making downstream operations idempotent.

A staging-table design is often safer than inserting directly into business tables because it separates ingestion, validation, reconciliation, and final merge operations.

Do Not Process a 20 GB Upload Inside One HTTP Request

An ASP.NET Core endpoint should generally accept or coordinate the upload, but it should not keep a request open while a multi-gigabyte import is parsed and written to a database.

A more resilient cloud architecture is:

Client upload ↓ Object storage ↓ Import metadata record ↓ Queue or job request ↓ Background .NET worker ↓ Streaming parser and batch writer ↓ Progress and completion status ↓ Client polls or receives notification

This separates upload availability from processing time and makes retries, cancellation, progress reporting, and horizontal scaling easier to manage.

Design the Processor Around Stream Instead of File Paths

A reusable ingestion service should ideally accept a Stream rather than requiring a local file path.

public Task<ImportResult> ImportAsync( Stream source, CancellationToken cancellationToken = default) { // Build the parser around the supplied stream. }

The caller can then supply:

a local FileStream;
an Azure Blob Storage stream;
an Amazon S3 response stream;
a decompression stream;
a secure network stream;
a test stream created from known sample data.

This design avoids unnecessary temporary copies and makes the processing logic easier to test.

Applying the Same Pattern to Other File Types

Format	Memory-efficient .NET approach	Important consideration
CSV or TSV	CsvHelper with a stream	Quoted fields can span lines
NDJSON	Read one line and deserialize one object	Each line must be an independent JSON value
Large JSON array	`DeserializeAsyncEnumerable<T>` or `Utf8JsonReader`	Avoid loading the complete JSON document model
XML	`XmlReader`	Use forward-only processing instead of loading a complete document
Fixed-width text	`StreamReader` with span-based field parsing	Validate line length before slicing
Application logs	Line-by-line streaming	Multiline exceptions may need grouping logic
Binary files	`ReadAsync` with a reusable byte buffer	Chunks may not align with logical records
GZip or ZIP	Wrap the source in a decompression stream	Do not extract the entire archive when streaming is possible
PDF or Office documents	Use format-specific extraction APIs	Some libraries build a complete in-memory document model

Streaming a Large JSON Array

await foreach ( EventRecord? item in JsonSerializer.DeserializeAsyncEnumerable<EventRecord>( sourceStream, cancellationToken: cancellationToken)) { if (item is null) { continue; } await ProcessEventAsync(item, cancellationToken); }

Streaming XML with XmlReader

var settings = new XmlReaderSettings { Async = true, DtdProcessing = DtdProcessing.Prohibit }; using XmlReader reader = XmlReader.Create(sourceStream, settings); while (await reader.ReadAsync()) { cancellationToken.ThrowIfCancellationRequested(); if (reader.NodeType == XmlNodeType.Element && reader.Name == "customer") { // Read and process the current customer element. } }

.NET Use Cases for Large-File Streaming

This architecture is useful for:

customer and account migrations;
financial transaction imports;
insurance or healthcare data exchanges;
product catalog and inventory imports;
application-log analysis;
telemetry ingestion;
data cleaning and normalization;
legacy-system modernization;
ETL and reporting pipelines;
document preparation for AI applications;
RAG ingestion and embedding workflows;
scheduled cloud data-processing jobs.

How Large-File Processing Connects to AI and RAG

Many AI systems begin with a data-ingestion problem rather than a model problem.

Large source files ↓ Streaming ingestion ↓ Validation and data cleaning ↓ Text or field extraction ↓ Chunking ↓ Embedding generation ↓ Vector index ↓ RAG application

Loading the complete dataset before generating embeddings can create unnecessary memory pressure and reduce recoverability. A bounded pipeline can read, clean, chunk, and submit a controlled number of items while respecting downstream API limits.

Costs and Scaling Considerations

Streaming reduces memory requirements, but the overall design still affects cloud and infrastructure costs.

Cost area	What drives it	How to control it
Memory	Batch size, channel capacity, record size and worker count	Keep every boundary bounded and measure peak working set
Compute	Parsing, validation, decompression and transformation	Scale workers only when CPU is the actual bottleneck
Storage	Source files, staging data, rejected records and retained copies	Define retention and lifecycle policies
Network	Moving large files between regions or services	Process close to the storage and destination systems
Database	Indexes, logging, locking, concurrency and batch size	Use staging tables, bulk operations and controlled concurrency
Failure recovery	Repeatedly processing an entire file	Use checkpoints, idempotency and bounded retries

Scaling should be based on measurements. Adding workers will not help when the database, storage service, network, or third-party API is already saturated.

Observability for Large Imports

At minimum, capture:

bytes processed;
records read;
records imported;
records rejected;
records processed per second;
bytes processed per second;
current batch duration;
database-write duration;
channel backlog;
retry count;
process working set;
garbage-collection activity;
estimated completion percentage;
total elapsed time.

Progress should be reported periodically rather than after every record. Excessive logging can become its own performance bottleneck.

Decision Framework

Situation	Recommended design
Small internal file with controlled format	Simple streaming reader and validation
Real CSV with quoting and escaped values	CsvHelper or another standards-aware CSV parser
Large CSV imported into SQL Server	Streaming parser, bounded batches, staging table and SqlBulkCopy
CPU-intensive transformation	Bounded Channel with a measured number of workers
Several large files	One sequential parser per file with a global concurrency limit
Large cloud upload	Object storage, queue or job request, and a background worker
Import must resume after failure	Committed batch checkpoints, staging and idempotency
JSON or XML instead of CSV	Forward-only format-specific streaming reader

Final Recommendations

A 20 GB file does not require a 20 GB in-memory collection. A reliable .NET solution should:

open the source as a stream;
read sequentially;
use a parser appropriate for the format;
avoid ToList, ToArray, and full-file document models;
validate and transform records incrementally;
keep queues, batches, and worker counts bounded;
use bulk destination operations;
write rejected records progressively;
support cancellation and restart;
make repeated execution idempotent;
measure memory, throughput, and downstream latency.

The most important architectural rule is not a particular buffer size or library. It is maintaining a clear upper boundary on how much data every stage is allowed to retain.

Frequently Asked Questions

Can .NET process a 20 GB CSV file with limited memory?

Yes. The application can process a 20 GB CSV file with limited memory when it reads the source as a stream and retains only the current record, a controlled batch, and any bounded pipeline buffers. Memory usage depends on those boundaries rather than the complete file size.

Does async file reading reduce memory usage?

Not automatically. Asynchronous I/O reduces thread blocking while waiting for I/O. A method such as ReadAllLinesAsync is asynchronous but still loads every line into memory. Streaming and asynchrony solve different problems.

What buffer size should I use for a large file?

Start with a moderate buffer such as 64 KB or 128 KB and benchmark with representative files and storage. Very large buffers do not automatically produce better throughput and increase memory usage when many files are processed concurrently.

What is the best database batch size?

There is no universal best size. Values between 1,000 and 10,000 records are reasonable starting points, but record width, database indexes, transaction logging, latency, retry requirements, and available memory should determine the final value.

Can I parse a CSV file with string.Split?

Only when the format is strictly controlled and fields can never contain delimiters, quotation marks, or line breaks. For general CSV files, use a parser that supports CSV quoting and escaping rules.

Does CsvHelper load the complete file into memory?

CsvHelper can enumerate records progressively when its record sequence is consumed directly. Calling ToList, ToArray, or another materializing operation will load all resulting records into a collection.

Should I read one large CSV file using multiple threads?

Usually not. CSV partition boundaries can fall inside quoted or multiline fields. A safer design uses one sequential parser and parallelizes independent downstream transformations with bounded concurrency.

How should a large CSV upload be handled in ASP.NET Core?

Store the uploaded file in durable storage, create an import job, and process it with a background worker. The client can query job status or receive a completion notification instead of keeping one HTTP request open during the entire import.

How can large JSON and XML files be processed with low memory?

For JSON, use an asynchronous enumerable or a forward-only UTF-8 reader. For XML, use XmlReader. Avoid loading a complete JSON or XML document model when only progressive processing is required.

What happens if one record is extremely large?

Streaming limits memory relative to the amount of data currently being processed, but one exceptionally large record can still create a memory spike. Apply maximum record and field sizes when the business format allows them.

How to Process a Large CSV File in .NET (10 GB, 20 GB, or More)